[Bioclusters] topbiocluster.org

Fri Jun 24 18:43:48 EDT 2005

James Cuff wrote:

> So let's assume this really, really simple 'pipeline' to test things, it
> takes a protein, searches the known protein database, builds a multiple
> sequence alignment, makes a model, and searches that against the
> database once more to identify further sequence and hint at the domain
> structure.  
> 
> A kinda classic annotation problem.

Curiously, this scenario is in part what BBS was designed to allow you 
to measure.  Each experiment step (or protocol step if you prefer) gets 
a section of a structured document which has a very simple flow model at 
the moment.  If your pipeline is representable as a linear graph, then 
the simple structure of the input allows you to sequentially run all 
these steps, and benchmark them.  We are thinking about how to represent 
  cyclical graphs (loops), decision points etc.

We aimed the original design to allow end users to easily model their 
workflows, as long as they were linear (think Henry Ford's comments 
about getting any color car you want, as long as it is black).  I think 
this would be something to be changed if there is interest/demand for it.

[...]

> Great, but hardly a difficult problem.
> 
> But now instead lets assume that testdb is genbank nr, bit harder.  In
> fact a bit of a nightmare. As of yesterday there were 2.5M sequences,
> and test.fa could also be faked up to have either 2, 200, 2000, or 20k
> sequences. 
> 
> All of a sudden you need a cluster, that puppy will fail really badly on
> one node, we all remember why we bought the clusters in the first
> place :-)

I have had situations where people have absolutely insisted that they do 
not need high performance computing systems, though they just bought a 
cluster ...  some sort of cognative dissonance going on there.

A pipeline is a high performance computing problem.  There are many ways 
to implement it, and a few of them are even good.

[...]

> It should not matter, just get to the finish line, don't miss out any
> steps and show how many sequences were used in each step.  Seems fair?

Yes.  We can largely do this today in BBS.  Will need some tweaking, so 
if you provide something like what you did above, we should be able to 
hammer down the rough edges to get there.

> The more I think about this the more it makes sense.  We should probably
> keep to a fairly simple set of recipes akin to the one above, it also
> makes some sort of scientific sense as it is a real experiment which is
> kinda nice...

It is much better in a sense than the baseline tests we set up for BBS. 
  BBS was designed to do this, but we were relying on others to generate 
interesting/meaningful tests.

> I'm not sure we even need to get all that fancy with toolkits and the
> like, just a reference set of data and the instructions.  Run this, that
> and the other etc.  We can even evaluate the final models / results that
> are produced in the last line of the search to make sure folk don't
> cheat...  
> 
> I think as Joe said yesterday, if we can distill some other problems
> into the same form, I'll be happy to collate write them up and supply
> data, collect results etc.  I just picked the protein example because
> I'm obsessed with them, I know there are other crunch|awk|sed|save|etc
> pipelines out there.
> 
> It's already starting to be better than Linpack for representing
> workloads on clusters...  We just need some decent recipes, and I'm a
> bad cook.

Heh... I burnt the water (baseline tests) :)

I might suggest that the groups that are interested in this provide not 
just the pipeline steps, but also the results and their own baseline 
measurement (more in a second).

If we could do something really simple like build a "Pipeline 
construction language" that looked like simple xml ...

	<pipeline name='cooking_with_protein'>
	 <stage order="initialization" nodes='master' >
	  <db_extract ...  />
	  <db_extract ...  />
	  ...

	  <data distribute="files..." nodes="all" />
	 </stage>
	 <stage order="1" nodes='all' label="blast" >
	  <run package="blast" input="input_file" output="..." nodes="all" />
	 </stage>
	 <stage order="2" nodes='all' label="fastacmd" >
	  <run package="fastacmd" input="..." output="...." nodes="all" />
	 </stage>
	 <stage order="finalize" nodes='master' >
	  <data gather="files..." nodes="all" />
	  <db_submit ...  />
	  <db_submit ...  />
	  ...

	 </stage>
	</pipeline>

then we could distribute pipelines a bit easier.  We can do something 
like this today with the BBS format.  Please let me know if there is 
interest in this.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615