[Bioclusters] Gridlet test of BLAST using datagrid directories.

Mon, 25 Nov 2002 23:26:27 -0500 (EST)

Dear Bio-cluster folks,

This test (quick hack) shows how one can use multiple
computers, including spare MacOSX, Windows, and Linux
workstations, to distribute and speed up large biosequence
analyses, BLAST in this example.  If you can split large data
sets to small subsets distributed to many computers, analyze each
subset and reassemble subset results to a whole, you should
be able trade time for compute nodes.

Note the data directory at bio-mirror.net is fragile - please don't test 
more than a few 1000s of sequences fetched from bio-mirror.net data
directory (though I've tested millions, it is still subject to
occasional faults and hasn't been tested under a large load).
Feel free to set up and run large tests on your own computers :)
Note also that documentation on these datagrid directories
is still sparse.

See http://iubio.bio.indiana.edu/biogrid/
    http://iubio.bio.indiana.edu/biogrid/directories/
    http://iubio.bio.indiana.edu/biogrid/directories/gridlets/

For each compute node on your test grid, do this:

   1. Install/test/locate NCBI BLAST software

   2. Download Biogridlet .class and .prop files. Edit .prop properties
      to use the biosequence databank you want.

   3. Find a query biosequence somewhere.

   4. Use Biogridlet to copy a databank subset to each node and run blast:
         1. node1:

java Biogridlet start=0 count=1000 | $bl/formatdb -i stdin -p F -o T -n databank1 
$bl/blastall -p blastn -d databank1 -i query -m 8 -o databank1.out

         2. node2:

java Biogridlet start=1000 count=1000 | $bl/formatdb -i stdin -p F -o T -n databank2  
$bl/blastall -p blastn -d databank2 -i query -m 8 -o databank2.out

         3. node3 .. n 

   5. Copy blast results from each node and assemble to full
   result (yet to do; see NBLAST for how :)

The runtime cost for this grid example, from a few quick tests,
is approximately the time it takes to run on one computer with a
full databank, divided by the number of nodes and subset
databanks you use.

This test bypasses the sophistication of grid infrastructure like
Globus, GridEngine, etc. for sake of simplicity.  It eventually
could work somewhere between the cases of SETI@HOME and Globus
in terms of simplicity versus controls.

-- Don Gilbert
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd@bio.indiana.edu