On 24 Jun 2005, at 4:06 pm, Brodie, Kent wrote: > Taking into account the whole pipeline (including networked I/O, > formatdb, etc) is both a great idea and will give much more realistic > results. > > I also think that a collection of data would be a catalyst for great > future discussions and questions, e..g, "how the heck did you get your > formatdb to run so fast on the 20K data?", the responses would then > give > the rest of us who may be a bit behind in these things great > insight and > ideas. > > I'd be VERY interested to see if anyone has results from using cluster > filesystems, for example..... Cluster filesystems have *drastically* cut our data distribution time. We can distribute a new multi-GB genome data set to all the machines that use cluster filesystems in a few minutes. The old RLX blades, which have to rely on the hierarchy of rsync processes to which James referred, trail in a dismal few hours later. They've also increased performance when running jobs; the machines can suck data over the filesystem's GB ethernet faster than the individual spindles could supply the data locally. We've been using cluster filesystems (specifically, GPFS) in production since October 2003, for the static datasets; blastables and so on. This is going to continue, and we've been so pleased with it as a method, that it's going to be extended. The number of nodes per cluster filesystem (currently 14) will be expanded, hopefully to the entire cluster. Scratch filesystems for the cluster will be moved to GPFS or Lustre, rather than NFS, which is where they are currently. We're not wedded to GPFS - Lustre looks good too. LSF is already running off a GPFS cluster filesystem so that it can fail over without the performance sucking because of NFS (yay! No more LSF masters on Tru64! Woohoo!) The dream of a 1000+ node cluster entirely without NFS takes a step closer to reality... I'd be happy to run one of James' mini pipelines on Sanger's cluster, if I could actually persuade Ensembl to give me a couple of hours of completely clear air to actually get the benchmark done. :-) Tim -- Dr Tim Cutts Informatics Systems Group, Wellcome Trust Sanger Institute GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233