[Bioclusters] http://www.sistina.com/products_gfs.htm

Ivo Grosse bioclusters@bioinformatics.org
Mon, 13 May 2002 13:32:10 -0400


Hi Joe,

thanks for your great *general* answer.


Hi Joe and Chris and others,

I try to make my question more *specific*:


0. we often use Blast, and we often blast two large sets against each 
other, e.g. the human against the mouse genome.  In that example, one 
genome (e.g. mouse) will be the database, and we will chop up the human 
genome into, say, 101-kb pieces  overlapping by 1 kb, and then throw 
those 30,000 101-kb pieces against the mouse database using SGE.  We 
(in our group) do NOT need or want Mosix.

1. the (mouse) database will live in RAM (of each slave node), and the 
way in which we feed the database to the RAM for each of the 30,000 
jobs is as follows:

- cp the database to /tmp/ of ALL of the slave nodes.

- start the 30,000 jobs through SGE, where the database is READ from 
/tmp/ (on the local node) and the output is WRITTEN to the central file 
server.

This is, of course, much faster than reading a GB-size database from 
the central file server 30,000 times.

2. another group here at CSHL is currently in the process of preparing 
the installation of a new cluster, and they have some good reasons for 
choosing Mosix.  But once in a wile they also need to run Blast jobs, 
of similar sizes as ours.  The question is: can Mosix + GFS + DFSA 
support a protocol similar to 1.?

Best regards, Ivo


P.S.

Instead of writing N identical replicas of the database to the N slave 
nodes, one could keep just one copy of the database on /pvfs/, which is 
accessible through all of the slave nodes.  Then, however, the GB-size 
database would need to be read through the network 30,000 times.  Is 
this correct?


P.P.S.

Do you know a smarter (than 1.) way of running the Blast jobs?