[Biodevelopers] Doc-object search/retrieval and RDBMS in bioinformatics

Don Gilbert gilbertd at bio.indiana.edu
Thu Apr 22 14:33:11 EDT 2004

Dear biodevs,

Simon Twigger mentioned that organism genome databases in the GMOD group
are going toward a RDBMS schema with Postgres as its example (but not
only) management database.  And Marc Dumontier talked about BIND's use
of Lucene for object search/retrieval.

In the FlyBase project we are using this Chado database schema now and
Pg for management, but also moving to Lucene with a read-only document/object
database for serving much of the public web access, because it supports
the close match between "gene document objects" that customers see and
underlying data representation, and it is much speedier than generating
such objects from RDBMS.  Basically it involved doing the same thing
once as a document dump of the database, rather than for each web request,
and using Lucene to index and retrieve the gene objects on demand,
reformatting to html (or other) as needed.

Also with Lucene it is easy to provide datamining support for
bulk retrieval of sequences, genome attributes, etc., around the
native format of these objects.  I'm appending a note about this
GMOD project LuceGene for adapting Lucene to bio-data.

-- Don Gilbert

GMOD: LuceGene
Document/Object Search and Retrieval for Genome Databases
20 April 2004


This is an open-source document/object search and retrieval system
specially tuned for bioinformatics text databases and documents. It is
part of the GMOD (Generic Model Organism Database) project,
http://www.gmod.org/lucegene/, and also

LuceGene is similar in concept to the widely used, commercially
successful, bioinformatics program SRS (Sequence Retrieval System).
It is built on top of the open-source Lucene package,
Though written in Java language, it can be used from command-line
shells, and performs well that way (current uses include Perl CGI's
calling lucegene). Lucene is used by LuceGene un-changed, but LuceGene
adds Lucene class overrides for biology data.

It includes common text search features: booleans, phrases, word
stemming, fuzzy and field range searches, relevance ranking. Lucene is
comparable to the index/search methods used by web-indexing systems such
as Glimpse, Exite, Alta-vista, and Google.

LuceGene additions include Data input adaptors for HTML; XML (e.g.
MedLine); FlyBase flatfile; Biosequences (GenBank, EMBL, etc.) Basic
output formats for XML, HTML via XSLT, Text, Spreadsheet. Numeric Range
search primitive (added April 2004).

It is being tested and used to search/retrieve from 100,000s data and
document objects in the FlyBase and euGenes collection: genes,
references, sequences and XML annotations, Medline abstracts and
HTML, PDF and text documents.

Public services using LuceGene (Apr 2004)

euGenes multi-organism gene search/retrieval

Daphnia/wFleaBase search for sequences, Medline abstracts, web documents

FlyBase Annotated sequence bulk-retrieval service 

FlyBase Apollo annotation data web service 


LuceGene requires Java 1.4 or later  to compile and run.
The Java Ant build system is supported for compiling sources.
The Jakarta Lucene project library is included with this package, as
are other required java libraries.  It may also be found
from http://jakarta.apache.org/lucene/

Currently these alpha distribution files are available -
 lucegene-1.2-src.jar : sources, documents, configuration for base
lucegene software with indexing methods for biology data
 lucegene.war : binary distribution, for webapp (Tomcat) uses

See the cvs.sourceforge.net repository for gmod/lucegene. 
It is also available as part of the ARGOS genome database
replication system at

-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd at indiana.edu--http://marmot.bio.indiana.edu/

More information about the Biodevelopers mailing list