Main»Home Page

Home Page

Introduction

RDFBuilder is a Java library designed to enable RDF-based access to the ArrayExpress public microarray repositories. The library and its source code are freely available through the Bioinformatics.org web.

Through the methods contained in this library, SPARQL queries can be submitted againt the ArrayExpress repositories, producing results in SPARQL format. The schema for constructing the queries is an RDF that resembles the MAGE-OM object model.

At the same time, we developed a web-application for easily testing the offered functionallity. This application can be accessed on the following URL:

http://servet.dia.fi.upm.es:8080/ArrayExpressWrapperWeb/

Instructions for using the application can be found in the same address.

Downloads

The source code of RDFBuilder is available in the following svn repository (accessible with the anonymous user):

svn://bioinformatics.org/svnroot/rdfbuilder/trunk

There is a README.TXT file in the project root which explains how to use the library, plus several examples. Precompiled jars, documentation and files needed for installation can also be downloaded from the project's download directory, available at:

http://www.bioinformatics.org/ftp/pub/RDFBuilder/

(If this link is down please use http://achucarro.dia.fi.upm.es:8080/RDFBuilder/)

Please read the README.TXT file contained in this ftp to learn the purpose of the available files, or read the subsection "Standalone installation" below.

Library instructions

Introduction

This document explains how to use the RDFbuilder API for enabling RDF-based access to the ArrayExpress microarray database. This API is written in Java 6.0, so any operating system that has a compatible Java Virtual machine is supported. Please ensure that you have Java installed in the target machine before proceeding to use this API. You can download the latest Java version from http://www.java.com/.

The goal of RDFbuilder is to enable RDF-based access to the ArrayExpress databases. These public databases store microarray data. The employed format is MAGE-ML, a language for representing microarray experiments data developed by the ArrayExpress consortium. The only way to query these data is to access a simple web-interface with some forms. Using RDFbuilder, researchers can submit SPARQL queries for ArrayExpress and receive results in SPARQL results format. Enabling RDF-based access to ArrayExpress will facilitate its integration with other biomedical data (e.g. clinical data).

Web-based access to the tool

The functionality of this API can be used and tested through a web-based graphical interface. Please go to the following address to access the tool:

http://servet.dia.fi.upm.es:8080/ArrayExpressWrapperWeb/

The website includes the necessary instructions for using the tool online.

Standalone installation

The API can be used in a standalone application. All its functionality is bundled in the JAR that can be downloaded at the following web page: ftp://bioinformatics.org/pub/RDFBuilder/.

The download includes the following items:

 - A README.TXT file (this document) with use and installation instructions
 - A jar file with the java code of the RDFbuilder API
 - The JavaDoc documentation of the API, inside the javadoc.rar file
 - An example configuration file, named aewrapperConfig.xml (see instructions below for editing and placing this file)
 - The file mage-rdf-model-empty.obm, for internal use of the library. After a directory has been configured as cache directory for the library, this file must be placed inside that directory.
 - The mage-rdf-model-empty.owl file, containing the RDF-based schema of ArrayExpress (this schema defines the valid SPARQL queries)

The jar file contains all the pre-compiled classes that compose the RDFbuilder. To learn how to use these classes, please refer to the attached JavaDoc files. In addition, the file es.upm.gib.aewrapper.queryprocessing.test.SimpleTest contains a main method with the code needed to launch a query using RDFbuilder. The following code snippet shows how the API can be invoked:

String query = "PREFIX base: <http://aewrapper#>\n" +
"SELECT ?id ?name ?desc_text\n" +
"WHERE {\n" +
"?exp base:identifier_string ?id .\n" +
"?exp base:name_string ?name .\n" +
"?exp a base:Experiment.Experiment .\n" +
"?exp base:descriptions ?desc .\n" +
"?desc base:text_string ?desc_text .\n" +
"}";
String experimentId = "E-GEOD-1509";
// resultFile will contain the path to the file containing the results in SPARQL format
String resultFile = QueryProcessor.processQuery(query, experimentId);

In this code, the API is invoked with a SPARQL query (String) and a single experiment id (String). RDFbuilder translates the data of the specified experiment into RDF and performs the given SPARQL query, producing a file with SPARQL results format. There are some more options available when performing queries, such as the possibility of specifying a set of keywords instead of a single experiment. Please refer to the JavaDoc to get further details.

The API makes use of the disk drive of the computer where it executes to cache data from ArrayExpress. The directory for this cache is configurable through an xml configuration file. This configuration file must be named aewrapperConfig.xml, and must be placed inside a directory named /AEWRAPPER_CONFIG, which must be inside the base execution directory. For example, if our base execution directory is C:/executionDir/, then the xml configuration file should be in C:/executionDir/AEWRAPPER_CONFIG/aewrapperConfig.xml. The config file root tag is <aewrapper-config>.

Inside this tag there is one mandatory tag named <base-dir>, and two optional tags named <limit-experiment-count> and <limit-vector-ranges>.

The value in the base-dir tag indicates the directory where the cache will be placed. This must be a valid directory and it is necessary for the proper functioning of the library. Inside this directory we must also place the "mage-rdf-model-empty.obm" file that comes bundled with this library.

Example of config file (for a windows system with the cache base dir in C:\ArrayExpressWrapper)

    <?xml version="1.0" encoding="UTF-8"?>
    <aewrapper-config>
        <base-dir>C:\ArrayExpressWrapper</base-dir>
    </aewrapper-config>

In this example, the file mage-rdf-model-empty.obm should be placed inside C:\ArrayExpressWrapper\

Another example of config file, including the two optional tags:

    <?xml version="1.0" encoding="UTF-8"?>
    <aewrapper-config>
        <base-dir>C:\ArrayExpressWrapper</base-dir>
        <limit-experiment-count>5</limit-experiment-count>
        <limit-vector-ranges>100</limit-vector-ranges>
    </aewrapper-config>

In this example, with the optional tags we add two restrictions:

- We limit the number of experiments that are retrieved from the array express database to 300. This is to prevent the retrieval of too many databases in order to solve queries with keywords. For example, if a query is submitted with the keyword "organism", more than 23000 related experiments are found. Only downloading this amount of data could take several days. This value limit the number of downloaded experiments for answering a single query

- We limit the number of instances that are loaded from each MAGE-ML model (discarding the rest). This is useful if we want to execute the software in machines with fairly limited RAM size. With a value of 10000, the data should fit in a machine with 4GB of RAM. NOTE: adjust your java configuration to accept this amount of memory. To do this see for example http://www.caucho.com/resin-3.0/performance/jvm-tuning.xtp

Once the configuration file is properly set and placed, we can invoke the Java methods contained in the API. The software will create a directory called localExps inside the cache directory for storing downloaded data. This directory can be erased at any moment, thus clearing the cache. In addition, for each query submitted, the API will create a session directory inside the cache dir, looking something like query_session__2011-06-02--12-36-53__0. These directories store files created to answer submitted queries, and the result files for such queries. They can be erased after the results have been acquired.

Contact

For any comments, questions or suggestions, please write an email to aanguita@infomed.dia.fi.upm.es