On Fri, 24 Jun 2005, Tim Cutts wrote: > Ultimately, of course, that final reduction of the vector to a single > number has to be a site-specific formula with weightings for each > element of the vector determined by the requirements of the > organisation. (bit long this post, but might be worth a read) This is true. So I was looking at two different things yesterday after sending this post in: - the connections to the 'website' - my low coverage genome sequence analysis running on our farm I saw two things that are vectors for want of a better word. One there is interest in this concept, there were 99 unique hosts that looked at the page, that's ca 16% of the list so far, I think we are about 600 on the list: # egrep '24\/Jun|23\/Jun' access_log | awk '{print $1}' | sort -u | wc 99 99 1362 The second one, was waiting for ca 500,000 jobs to complete in my real job (sigh). We have a moderate sized cluster of ca 200 cpus, and I'm thinking to myself, "you know this would probably run a whole lot faster if we were to run it on the Sanger uber compute". Then I also thought, you know, it might not. It's a fairly I/O bound blastz application, and it hits networks and file systems pretty hard, has a bit of hairy parsing and a bit of database action later on. So it's not always as clear as it would seem off the bat. This made me think of 'clusters', rather than anything else. ibt is great, but the focus is on a single instance running many codes. In real life we split our workloads up in to bits and then run. We could between us come up with a few mini pipelines, that each could run, and show the results as a sortable table/list that will have a final vector, i.e. how long does it take to complete all the tasks. Here's an example: - start clock - take 20k refseqs against chr1 - find top hits, compare against nr.genbank - extract out all significant matches - make 20k multiple sequence alignments of matches - hmm calibrate all 20k - research nr.genbank with models from previous step - stop clock Now, one can see that there are various algorithms one could use at each stage, blat, blastz, blast, clustal, msa, hmm etc. This sort of micro pipeline is what a lot of folk run, and there are always faster ways to do certain steps. Now here's the bit I like. One can 'scale' this approach depending on the size of one's farm. For instance I'd pick say various classes of clusters, bit like formula 1 car racing (I guess I could also say NASCAR, but the word makes me shiver). Each class could then be represented: - A Class - 50k sequences - B Class - 2k sequences - C Class - 100 sequences I think the 'race' concept could work, as long as all the steps are included. I know that 50k jobs over NFS to 2000 nodes is going to crap out and die, so one would have to include distribution steps in the process, formatdbs etc. I'd then propose I build a set of tables, that can be sorted depending about what you care about: 200k sequence set: name | formating | blasting | aligns | parsing | Num CPU | wallclock | | | | | | | (seconds) | (hours) |(hours) | (hours) | | (hours) ---------------------------------------------------------------------------------- broad 3 1.23 1 6 200 400 sanger 400,000 2 70 1 2.6e+40 75 ucsc 1 0.4 0.2 1 1 0.6 2k sequence set: name | formating | blasting | aligns | parsing | Num CPU | wallclock | | | | | | | (seconds) | (hours) |(hours) | (hours) | | (hours) ---------------------------------------------------------------------------------- broad 1 1.23 1 6 200 4 sanger 400,000 0.2 0.3 1 2.6e+40 0.3 ucsc 0.1 0.1 0.2 1 1 0.6 etc. The data is fake, but the basic idea is that Jim Kent's group would use blat, fast C parsers, and his own custom DRM software to distribute the jobs. Tim would have to distribute data and code to stop the cluster from melting... I'd use a dodgy shell script etc. etc. I guess you get the picture. So in theory, all things being equal, we would need a set of data, define the number of sequences for each class, say what to search and how much of it. And then start the engines, and grab the results for each step. This type of mini pipeline approach should capture I/O, DRM and compute, with the 'smart' factor of selecting the right algorithms, distribution code etc. etc. I remember taking part in the CASP structure prediction experiment (contest), and the community worked well there trying to get the best prediction in a short space of time. So if we were to come up with some decent 'pipelines' would this be a good idea? We could then have a pseudo 'contest' to see who can get to the finish line fastest? Best, j.