[Bioclusters] topbiocluster.org

Fri Jun 24 12:48:15 EDT 2005

On Fri, 2005-06-24 at 17:02 +0100, Tim Cutts wrote:

> The dream of a 1000+ node cluster entirely without NFS takes a step  
> closer to reality...

whoop!

> I'd be happy to run one of James' mini pipelines on Sanger's cluster,  
> if I could actually persuade Ensembl to give me a couple of hours of  
> completely clear air to actually get the benchmark done.  :-)

Grand.  

More random thoughts.  See what happens when our cluster is loaded, it's
hot outside, I start to thinking, it's really dangerous I know...  

Here's the next instalment of brain dumps:

So let's assume this really, really simple 'pipeline' to test things, it
takes a protein, searches the known protein database, builds a multiple
sequence alignment, makes a model, and searches that against the
database once more to identify further sequence and hint at the domain
structure.  

A kinda classic annotation problem.

Here we go, this is on a 1 cpu sun:

How big a problem do we have here?

bench/run> grep '>' testdb | wc
   2395   16644  153607

Ok, how big a sequence set?

bench/run> cat test.fa 
>AAN03382  
KVRFADLKRRILISEEQGSAGSSRHLLKKIQAKVLKTDQEFDGLYNDLLLEMARNQIFLI
NERQVSENQQIWLRQYFKQHLRQHITPILINHDTNLVQFLKDDYTYLAVEIIRGARTDYA
LLEIPSDKVPRFVNLPPEAPRRRKPMILLDNILRYCLDDIFKGFFDYDALNAYSMKMTRD
AEYDLVTEMESSLLELMSSSLKQRLTAEPVRFVYQRDMPNEMVELLRGKLGISNYDSVIA
GGRYHNFKDFISFPNVGKANLVNKPLPRLRHIWFDGFRNGFDAIREKDVLLYYPYHTFEH
VLELLRQASFDPSVLAIKINIYRVAKDSRIIESMIHAAHNGKKVTVVVELQARFDEEANI
HWAKRLTEAGVHVIFSAPGLKIHAKLFLISRREGDDIVRYAHIG

Start the clock:

bench/run> date
Fri Jun 24 12:09:56 EDT 2005

bench/run> time ./bin/formatdb -oT -pT -i testdb
0.160u 0.027s 0:00.18 100.0%    0+0k 0+0io 0pf+0w

bench/run> time ./bin/blastall -m8 -e1e-3 -p blastp -i test.fa -d testdb
| awk '{print $2}' | xargs -i ./bin/fastacmd -d testdb -t F -s {} >>
test.seqs
2.798u 0.002s 0:02.80 99.6%     0+0k 0+0io 0pf+0w

bench/run> time ./bin/clustalw -infile=test.seqs -outfile=test.aln >
& /dev/null
16.569u 0.000s 0:16.56 100.0%   0+0k 0+0io 0pf+0w

bench/run> time ./bin/hmmbuild test.hmm test.aln
2.798u 0.002s 0:02.80 99.6%     0+0k 0+0io 0pf+0w

bench/run> time ./bin/hmmcalibrate test.hmm 
43.411u 0.018s 0:43.42 100.0%   0+0k 0+0io 0pf+0w

bench/run> time ./bin/hmmsearch test.hmm testdb > results.out
23.887u 0.054s 0:23.94 99.9%    0+0k 0+0io 0pf+0w

bench/run> date
Fri Jun 24 12:12:55 EDT 2005

Ok so about 3 mins for this silly example, and we have all the vectors
in place for how long each step took.  I had to type things in also,
sort of distribution overhead :-)  So we end up with a few vectors:

numcpu=1
cpu=sparcv9 at 360 MHz
os=Solaris
storage=idedisk
network=10MB/s ethernet
memory=256MB

qsize=1
dbsize=2395

wallclock=3mins

format=0s
blast=2.8s
clustal=16s
hmmbuild=2.8s
hmmcalibrate=43s
hmmsearch=23s

runtime=format+blast+clustal+hmmbuild+hmmcalibrate+hmmsearch
overhead=runtime/wallclock

etc. etc.

Great, but hardly a difficult problem.

But now instead lets assume that testdb is genbank nr, bit harder.  In
fact a bit of a nightmare. As of yesterday there were 2.5M sequences,
and test.fa could also be faked up to have either 2, 200, 2000, or 20k
sequences. 

All of a sudden you need a cluster, that puppy will fail really badly on
one node, we all remember why we bought the clusters in the first
place :-)

This sort of test with wall clock gets harder there would have to be
lines like:

bench/run> time ./bin/formatdb -oT -pT -i testdb
0.160u 0.027s 0:00.18 100.0%    0+0k 0+0io 0pf+0w

would probably have to be replaced with something akin to
bench/run> splitdb testdb
bench/run> makexml
bench/run> distribute testdb
bench/run> bsub formatdb
bench/run> loadidsinmysqldatabase (whatever)

etc. etc.  All of which takes time, but needs to be done.  Again,
mpiblast or // versions of clustal can be used or the hmmsearch could be
done on hardware boxes.  etc. etc.  

It should not matter, just get to the finish line, don't miss out any
steps and show how many sequences were used in each step.  Seems fair?

The more I think about this the more it makes sense.  We should probably
keep to a fairly simple set of recipes akin to the one above, it also
makes some sort of scientific sense as it is a real experiment which is
kinda nice...

I'm not sure we even need to get all that fancy with toolkits and the
like, just a reference set of data and the instructions.  Run this, that
and the other etc.  We can even evaluate the final models / results that
are produced in the last line of the search to make sure folk don't
cheat...  

I think as Joe said yesterday, if we can distill some other problems
into the same form, I'll be happy to collate write them up and supply
data, collect results etc.  I just picked the protein example because
I'm obsessed with them, I know there are other crunch|awk|sed|save|etc
pipelines out there.

It's already starting to be better than Linpack for representing
workloads on clusters...  We just need some decent recipes, and I'm a
bad cook.