Dear biodevs, Simon Twigger mentioned that organism genome databases in the GMOD group are going toward a RDBMS schema with Postgres as its example (but not only) management database. And Marc Dumontier talked about BIND's use of Lucene for object search/retrieval. In the FlyBase project we are using this Chado database schema now and Pg for management, but also moving to Lucene with a read-only document/object database for serving much of the public web access, because it supports the close match between "gene document objects" that customers see and underlying data representation, and it is much speedier than generating such objects from RDBMS. Basically it involved doing the same thing once as a document dump of the database, rather than for each web request, and using Lucene to index and retrieve the gene objects on demand, reformatting to html (or other) as needed. Also with Lucene it is easy to provide datamining support for bulk retrieval of sequences, genome attributes, etc., around the native format of these objects. I'm appending a note about this GMOD project LuceGene for adapting Lucene to bio-data. -- Don Gilbert GMOD: LuceGene Document/Object Search and Retrieval for Genome Databases 20 April 2004 Description This is an open-source document/object search and retrieval system specially tuned for bioinformatics text databases and documents. It is part of the GMOD (Generic Model Organism Database) project, http://www.gmod.org/lucegene/, and also http://eugenes.org:8081/gmod/lucegene/ LuceGene is similar in concept to the widely used, commercially successful, bioinformatics program SRS (Sequence Retrieval System). It is built on top of the open-source Lucene package, http://jakarta.apache.org/lucene/ Though written in Java language, it can be used from command-line shells, and performs well that way (current uses include Perl CGI's calling lucegene). Lucene is used by LuceGene un-changed, but LuceGene adds Lucene class overrides for biology data. It includes common text search features: booleans, phrases, word stemming, fuzzy and field range searches, relevance ranking. Lucene is comparable to the index/search methods used by web-indexing systems such as Glimpse, Exite, Alta-vista, and Google. LuceGene additions include Data input adaptors for HTML; XML (e.g. MedLine); FlyBase flatfile; Biosequences (GenBank, EMBL, etc.) Basic output formats for XML, HTML via XSLT, Text, Spreadsheet. Numeric Range search primitive (added April 2004). It is being tested and used to search/retrieve from 100,000s data and document objects in the FlyBase and euGenes collection: genes, references, sequences and XML annotations, Medline abstracts and HTML, PDF and text documents. Public services using LuceGene (Apr 2004) euGenes multi-organism gene search/retrieval http://eugenes.org:7072/search/ Daphnia/wFleaBase search for sequences, Medline abstracts, web documents http://eugenes.org:7182/search/ FlyBase Annotated sequence bulk-retrieval service http://flybase.net/cgi-bin/gnoseqbatch FlyBase Apollo annotation data web service http://flybase.net/apollo/ Requirements LuceGene requires Java 1.4 or later to compile and run. The Java Ant build system is supported for compiling sources. The Jakarta Lucene project library is included with this package, as are other required java libraries. It may also be found from http://jakarta.apache.org/lucene/ Downloads Currently these alpha distribution files are available - lucegene-1.2-src.jar : sources, documents, configuration for base lucegene software with indexing methods for biology data lucegene.war : binary distribution, for webapp (Tomcat) uses See the cvs.sourceforge.net repository for gmod/lucegene. It is also available as part of the ARGOS genome database replication system at rsync://eugenes.org/argos/common/java/lucegene/ http://eugenes.org:8081/gmod/lucegene/ -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 -- gilbertd at indiana.edu--http://marmot.bio.indiana.edu/