For any questions, contact Alberto Anguita at aanguita@infomed.dia.fi.upm.es
Downloads
The source code of NCBI2RDF is available in the following ftp server:
http://www.bioinformatics.org/ftp/pub/ncbi2rdf/
There is a README.TXT file in the ftp which explains how to use the library, plus several examples and precompiled jars, documentation and configuration files needed for installation.
Please read the README.TXT file contained in this ftp to learn the purpose of the available files, or read the next subsection.
Library instructions
Introduction
The NCBI2RDF tool is a Java-based API for enabling RDF-compliant access to the NCBI databases. It offers a programmatic interface for posing
queries in SPARQL and receiving the results in SPARQL Results format. The API is quite straightforward to use, and its functionallity can be
easily understood by looking at the provided examples.
Tool installation
The API can be used in a standalone Java application. All its functionality is bundled in the JAR that can be downloaded at
the following web page: http://www.bioinformatics.org/ftp/pub/ncbi2rdf/.
The tool installation includse the following files:
- README.txt: this file
- NCBI2RDF.jar: the Java library containing all the tool code (including third-party libraries)
- JavaDoc.rar: the Javadoc documentation of the API
- examples.rar: a set of three examples in Java
- RDFSchema.rdf: the RDF schema that NCBI2RDF generates and that represents the available data in NCBI
- ConfigFiles.rar: this archive file contains a set of XML configuration files which NCBI2RDF needs in order to correctly work
To use the API in a Java project:
i) Download and decompress ConfigFiles.rar in the root directory of your Java project. This will create a directory called
EutolsWrapper, with three more directories containing the XML configuration files in it. These files must be placed there whenever the NCBI2RDF
API is invoked.
ii) Download and import the NCBI2RDF jar library and use the public class es.upm.gib.eutilsrdfwrapper.Controller. This class
offers a series of static methods for performing RDF-compliant queries over the NCBI databases, described below.
- public static String launchQueryGetPath(String query); Performs a query and retrieves the results as a SPARQL Results file query: a SPARQL query returns the path to the generated SPARQL Results file. This file will contain as many results as indicated in the LIMIT element of the SPARQL query, or 100 if no limit was indicated in the query - public static Results launchQueryGetResults(String query); Performs a query and retrieves the results as a Results object which allows retrieving the results as an iterator query: a SPARQL query returns a Results object for reading the query results - public static String launchQueryGetPath(ConceptsQuery query); Performs a query and retrieves the results as a SPARQL Results file query: a ConceptsQuery object containing the query to perform returns the path to the generated SPARQL Results file. This file will contain 100 results - public static Results launchQueryGetResults(ConceptsQuery query); Performs a query and retrieves the results as a Results object which allows retrieving the results as an iterator query: a ConceptsQuery object containing the query to perform returns a Results object for reading the query results |
As can be seen, the first method in the list admits a String parameter which must be a SPARQL-compliant query. This query should conform the
provided RDF schema in order to generate results. This method generates a file in SPARQL Results format and returns its path.
The other methods offer different formats for specifying the query or obtaining the results. The ConceptsQuery class offers a programmatic way
of defining queries to the system. Results class offers a programmatic way to retrieve results related to a posed query (it offers the methods
hasNext and nextRow to iterate through the query results).
It is recommended to check the attached examples to see how the API is invoked with some sample queries.
String query = "PREFIX base: <http://aewrapper#>\n" + "SELECT ?id ?name ?desc_text\n" + "WHERE {\n" + "?exp base:identifier_string ?id .\n" + "?exp base:name_string ?name .\n" + "?exp a base:Experiment.Experiment .\n" + "?exp base:descriptions ?desc .\n" + "?desc base:text_string ?desc_text .\n" + "}"; String experimentId = "E-GEOD-1509"; // resultFile will contain the path to the file containing the results in SPARQL format String resultFile = QueryProcessor.processQuery(query, experimentId); |
In this code, the API is invoked with a SPARQL query (String) and a single experiment id (String). RDFbuilder translates
the data of the specified experiment into RDF and performs the given SPARQL query, producing a file with SPARQL results
format. There are some more options available when performing queries, such as the possibility of specifying a set
of keywords instead of a single experiment. Please refer to the JavaDoc to get further details.
The API makes use of the disk drive of the computer where it executes to cache data from ArrayExpress. The directory
for this cache is configurable through an xml configuration file. This configuration file must be named
aewrapperConfig.xml, and must be placed inside a directory named /AEWRAPPER_CONFIG, which must be inside the
base execution directory. For example, if our base execution directory is C:/executionDir/, then the xml configuration
file should be in C:/executionDir/AEWRAPPER_CONFIG/aewrapperConfig.xml. The config file root tag is <aewrapper-config>.
Inside this tag there is one mandatory tag named <base-dir>, and two optional tags named <limit-experiment-count> and <limit-vector-ranges>.
The value in the base-dir tag indicates the directory where the cache will be placed. This must be a valid
directory and it is necessary for the proper functioning of the library. Inside this directory we must also place
the "mage-rdf-model-empty.obm" file that comes bundled with this library.
Example of config file (for a windows system with the cache base dir in C:\ArrayExpressWrapper)
<?xml version="1.0" encoding="UTF-8"?>
<aewrapper-config>
<base-dir>C:\ArrayExpressWrapper</base-dir>
</aewrapper-config>
In this example, the file mage-rdf-model-empty.obm should be placed inside C:\ArrayExpressWrapper\
Another example of config file, including the two optional tags:
<?xml version="1.0" encoding="UTF-8"?>
<aewrapper-config>
<base-dir>C:\ArrayExpressWrapper</base-dir>
<limit-experiment-count>5</limit-experiment-count>
<limit-vector-ranges>100</limit-vector-ranges>
</aewrapper-config>
In this example, with the optional tags we add two restrictions:
- We limit the number of experiments that are retrieved from the array express database to 300. This is to prevent
the retrieval of too many databases in order to solve queries with keywords. For example, if a query is submitted
with the keyword "organism", more than 23000 related experiments are found. Only downloading this amount of
data could take several days. This value limit the number of downloaded experiments for answering a single query
- We limit the number of instances that are loaded from each MAGE-ML model (discarding the rest). This
is useful if we want to execute the software in machines with fairly limited RAM size. With a value of 10000,
the data should fit in a machine with 4GB of RAM. NOTE: adjust your java configuration to accept this amount
of memory. To do this see for example http://www.caucho.com/resin-3.0/performance/jvm-tuning.xtp
Once the configuration file is properly set and placed, we can invoke the Java methods contained in the API. The
software will create a directory called localExps inside the cache directory for storing downloaded data. This
directory can be erased at any moment, thus clearing the cache. In addition, for each query submitted, the API will
create a session directory inside the cache dir, looking something like query_session__2011-06-02--12-36-53__0.
These directories store files created to answer submitted queries, and the result files for such queries.
They can be erased after the results have been acquired.
Contact
For any comments, questions or suggestions, please write an email to aanguita@infomed.dia.fi.upm.es