EDirect Perl library dependency problem
From Bioinformatics.Org Wiki
[return to BIRCH Release To Do List]
Background
My goal is to include Edirect [1] as part of the BIRCH bioinformatics package: [2]
BIRCH ties together numerous 3rd party tools using a programmable GUI that we have developed called BioLegato. BioLegato interfaces are available for many sorts of data, including DNA and protein sequences, alignments, molecular markers, phylogenetic trees [3]. A new BioLegato interface, called blncbi, provides a graphical interface for using Edirect tools [4].
Summary
NCBI Edirect is a set of wrappers for calling NCBI Eutils. The wrappers are useful because they allow Eutils API functions to be called as scripts. They add power to the Eutils API functions by integrating them into the Unix pipe functions, allowing complex use of Eutils functions within scripts.
EDirect has several platform dependency issues that need to be resolved:
- Requires CPAN. CPAN will always be present on systems used by Perl programmers, but are not a part of standard installations on most MacOSX and Linux releases "out of the box".
- Setup script - Actually two problems:
- On some platforms, the setup.sh script has failed to find modules.
- Where Perl uses object libraries, a gcc compiler must be present. Not all systems can be expected to have gcc. I know for a fact that gcc is NOT standard on any of the RedHat and Fedora releases, and presumably not on CentOS which is a RedHat clone.
- Run time dependencies - Perl modules and object libraries (ie. .so files) are not found.
Special considerations for BIRCH - In order to include the Edirect tools as part of a BIRCH distribution, it would be best if users didn't have to build Edirect on their systems. Many users will not have the sophistication to do that, and even sophisticated users will not necessarily have sysadmin privileges. The big problem seems to be that some .so files are needed, which would somehow have to be included, presumably, in the aux directory tree.
The dependency problem could be a special case on every host on which Edirect is installed. It will differ not just by the platform and release, but also by whatever packages the user/sysadmin has installed on a paritcular machine.
If this was Python or Java, the solution would be to distribute a .pyc of .jar file. Unfortunately, Perl has no such mechanism.
Another approach is to rewrite ncbiquery.py as a Java application, using the EUtils Java API. This might actually be more reliable and less work to maintain in the future.It might be worthwhile, in the long run, to rewrite Edirect in Java using the Java API for Eutils. This idea might have been good, but as of summer 2015, SOAP web services at NCBI will no longer be supported, which includes the Java API.
Can we use the .fcgi http utilities? This seems to be the approach most commonly used in the programming world. So let's see what's available in the Java and Python worlds.
Java
- BioJava - I am less optimistic about BioJava. Searches of the API don't seem to turn up API methods with words like Entrez or NCBI. The best I have found so far is a class NCBISequenceDB as a legacy class in BioJava 1.8.2. The current BioJava is 3.1. All it seems to have are some BLAST-specific classes.
Python
- BioPython - Bio.Entrez - This has methods for all the Eutils functions, and includes an XML parser. The PyDoc documentation seems easy to understand. One advantage is that it might be possible to use it directly in the existing ncbiquery.py code.
Tests
Test of build and execution on Linux and MacOSX systems EDirect Test 12Dec14A
Test of portability to new systems EDirect Portability Test 17Dec14A