On Fri, 2005-06-24 at 17:02 +0100, Tim Cutts wrote: > The dream of a 1000+ node cluster entirely without NFS takes a step > closer to reality... whoop! > I'd be happy to run one of James' mini pipelines on Sanger's cluster, > if I could actually persuade Ensembl to give me a couple of hours of > completely clear air to actually get the benchmark done. :-) Grand. More random thoughts. See what happens when our cluster is loaded, it's hot outside, I start to thinking, it's really dangerous I know... Here's the next instalment of brain dumps: So let's assume this really, really simple 'pipeline' to test things, it takes a protein, searches the known protein database, builds a multiple sequence alignment, makes a model, and searches that against the database once more to identify further sequence and hint at the domain structure. A kinda classic annotation problem. Here we go, this is on a 1 cpu sun: How big a problem do we have here? bench/run> grep '>' testdb | wc 2395 16644 153607 Ok, how big a sequence set? bench/run> cat test.fa >AAN03382 KVRFADLKRRILISEEQGSAGSSRHLLKKIQAKVLKTDQEFDGLYNDLLLEMARNQIFLI NERQVSENQQIWLRQYFKQHLRQHITPILINHDTNLVQFLKDDYTYLAVEIIRGARTDYA LLEIPSDKVPRFVNLPPEAPRRRKPMILLDNILRYCLDDIFKGFFDYDALNAYSMKMTRD AEYDLVTEMESSLLELMSSSLKQRLTAEPVRFVYQRDMPNEMVELLRGKLGISNYDSVIA GGRYHNFKDFISFPNVGKANLVNKPLPRLRHIWFDGFRNGFDAIREKDVLLYYPYHTFEH VLELLRQASFDPSVLAIKINIYRVAKDSRIIESMIHAAHNGKKVTVVVELQARFDEEANI HWAKRLTEAGVHVIFSAPGLKIHAKLFLISRREGDDIVRYAHIG Start the clock: bench/run> date Fri Jun 24 12:09:56 EDT 2005 bench/run> time ./bin/formatdb -oT -pT -i testdb 0.160u 0.027s 0:00.18 100.0% 0+0k 0+0io 0pf+0w bench/run> time ./bin/blastall -m8 -e1e-3 -p blastp -i test.fa -d testdb | awk '{print $2}' | xargs -i ./bin/fastacmd -d testdb -t F -s {} >> test.seqs 2.798u 0.002s 0:02.80 99.6% 0+0k 0+0io 0pf+0w bench/run> time ./bin/clustalw -infile=test.seqs -outfile=test.aln > & /dev/null 16.569u 0.000s 0:16.56 100.0% 0+0k 0+0io 0pf+0w bench/run> time ./bin/hmmbuild test.hmm test.aln 2.798u 0.002s 0:02.80 99.6% 0+0k 0+0io 0pf+0w bench/run> time ./bin/hmmcalibrate test.hmm 43.411u 0.018s 0:43.42 100.0% 0+0k 0+0io 0pf+0w bench/run> time ./bin/hmmsearch test.hmm testdb > results.out 23.887u 0.054s 0:23.94 99.9% 0+0k 0+0io 0pf+0w bench/run> date Fri Jun 24 12:12:55 EDT 2005 Ok so about 3 mins for this silly example, and we have all the vectors in place for how long each step took. I had to type things in also, sort of distribution overhead :-) So we end up with a few vectors: numcpu=1 cpu=sparcv9 at 360 MHz os=Solaris storage=idedisk network=10MB/s ethernet memory=256MB qsize=1 dbsize=2395 wallclock=3mins format=0s blast=2.8s clustal=16s hmmbuild=2.8s hmmcalibrate=43s hmmsearch=23s runtime=format+blast+clustal+hmmbuild+hmmcalibrate+hmmsearch overhead=runtime/wallclock etc. etc. Great, but hardly a difficult problem. But now instead lets assume that testdb is genbank nr, bit harder. In fact a bit of a nightmare. As of yesterday there were 2.5M sequences, and test.fa could also be faked up to have either 2, 200, 2000, or 20k sequences. All of a sudden you need a cluster, that puppy will fail really badly on one node, we all remember why we bought the clusters in the first place :-) This sort of test with wall clock gets harder there would have to be lines like: bench/run> time ./bin/formatdb -oT -pT -i testdb 0.160u 0.027s 0:00.18 100.0% 0+0k 0+0io 0pf+0w would probably have to be replaced with something akin to bench/run> splitdb testdb bench/run> makexml bench/run> distribute testdb bench/run> bsub formatdb bench/run> loadidsinmysqldatabase (whatever) etc. etc. All of which takes time, but needs to be done. Again, mpiblast or // versions of clustal can be used or the hmmsearch could be done on hardware boxes. etc. etc. It should not matter, just get to the finish line, don't miss out any steps and show how many sequences were used in each step. Seems fair? The more I think about this the more it makes sense. We should probably keep to a fairly simple set of recipes akin to the one above, it also makes some sort of scientific sense as it is a real experiment which is kinda nice... I'm not sure we even need to get all that fancy with toolkits and the like, just a reference set of data and the instructions. Run this, that and the other etc. We can even evaluate the final models / results that are produced in the last line of the search to make sure folk don't cheat... I think as Joe said yesterday, if we can distill some other problems into the same form, I'll be happy to collate write them up and supply data, collect results etc. I just picked the protein example because I'm obsessed with them, I know there are other crunch|awk|sed|save|etc pipelines out there. It's already starting to be better than Linpack for representing workloads on clusters... We just need some decent recipes, and I'm a bad cook.