[Bioclusters] topbiocluster.org

James Cuff jcuff at broad.mit.edu
Fri Jun 24 10:46:17 EDT 2005

On Fri, 24 Jun 2005, Tim Cutts wrote:

> Ultimately, of course, that final reduction of the vector to a single  
> number has to be a site-specific formula with weightings for each  
> element of the vector determined by the requirements of the  
> organisation.

(bit long this post, but might be worth a read)

This is true.  So I was looking at two different things yesterday after
sending this post in:

- the connections to the 'website'
- my low coverage genome sequence analysis running on our farm

I saw two things that are vectors for want of a better word.  One there is
interest in this concept, there were 99 unique hosts that looked at the
page, that's ca 16% of the list so far, I think we are about 600 on the

# egrep '24\/Jun|23\/Jun' access_log | awk '{print $1}' | sort -u | wc
      99      99    1362

The second one, was waiting for ca 500,000 jobs to complete in my real
job (sigh).

We have a moderate sized cluster of ca 200 cpus, and I'm thinking to
myself, "you know this would probably run a whole lot faster if we were to
run it on the Sanger uber compute".

Then I also thought, you know, it might not.  It's a fairly I/O bound
blastz application, and it hits networks and file systems pretty hard, has
a bit of hairy parsing and a bit of database action later on.  So it's not
always as clear as it would seem off the bat.  

This made me think of 'clusters', rather than anything else.  ibt is
great, but the focus is on a single instance running many codes.  In real
life we split our workloads up in to bits and then run.

We could between us come up with a few mini pipelines, that each could
run, and show the results as a sortable table/list that will have a final
vector, i.e. how long does it take to complete all the tasks.  Here's an

- start clock
- take 20k refseqs against chr1
- find top hits, compare against nr.genbank
- extract out all significant matches
- make 20k multiple sequence alignments of matches
- hmm calibrate all 20k
- research nr.genbank with models from previous step
- stop clock

Now, one can see that there are various algorithms one could use at each
stage, blat, blastz, blast, clustal, msa, hmm etc.  This sort of micro
pipeline is what a lot of folk run, and there are always faster ways to do
certain steps.  

Now here's the bit I like.  One can 'scale' this approach depending on the
size of one's farm.  For instance I'd pick say various classes of
clusters, bit like formula 1 car racing (I guess I could also say NASCAR,
but the word makes me shiver).  Each class could then be represented:

- A Class - 50k sequences
- B Class - 2k sequences
- C Class - 100 sequences

I think the 'race' concept could work, as long as all the steps are
included.  I know that 50k jobs over NFS to 2000 nodes is going to crap
out and die, so one would have to include distribution steps in the
process, formatdbs etc.  I'd then propose I build a set of tables, that
can be sorted depending about what you care about:

200k sequence set:

name | formating | blasting | aligns | parsing | Num CPU  |  wallclock
     |           |          |        |         |          |
     | (seconds) | (hours)  |(hours) | (hours) |          |   (hours)
broad    3           1.23      1           6          200       400
sanger  400,000        2       70          1      2.6e+40        75
ucsc     1           0.4       0.2         1            1       0.6

2k sequence set:

name | formating | blasting | aligns | parsing | Num CPU  |  wallclock
     |           |          |        |         |          |
     | (seconds) | (hours)  |(hours) | (hours) |          |   (hours)
broad    1           1.23      1           6          200         4
sanger  400,000      0.2       0.3          1      2.6e+40      0.3
ucsc     0.1         0.1       0.2         1            1       0.6


The data is fake, but the basic idea is that Jim Kent's group would use
blat, fast C parsers, and his own custom DRM software to distribute the
jobs.  Tim would have to distribute data and code to stop the cluster from
melting...  I'd use a dodgy shell script etc. etc.

I guess you get the picture.

So in theory, all things being equal, we would need a set of data, define
the number of sequences for each class, say what to search and how much of
it.  And then start the engines, and grab the results for each step.

This type of mini pipeline approach should capture I/O, DRM and compute,
with the 'smart' factor of selecting the right algorithms, distribution
code etc. etc.

I remember taking part in the CASP structure prediction experiment
(contest), and the community worked well there trying to get the best
prediction in a short space of time.  

So if we were to come up with some decent 'pipelines' would this be a good
idea?  We could then have a pseudo 'contest' to see who can get to the
finish line fastest?



More information about the Bioclusters mailing list