I'd like to hear more about your experience with clusterfs--- I've been here in this role (managing the technology of the bio stuff here), and am disappointed that it's set up with nothing fancier than a bunch of compute nodes that her their data off of an NFS file server. I assume the general concept is that each of the nodes "donates" some chunk of local disk space to the greater whole, and they all share the I/O tasks? For example, we have a bunch of high performance nodes that have 2x73GB drives... I could easily give up one drive from each system for the cluster file environment...... > -----Original Message----- > From: bioclusters-bounces+brodie=mcw.edu at bioinformatics.org > [mailto:bioclusters-bounces+brodie=mcw.edu at bioinformatics.org] On Behalf > Of Tim Cutts > Sent: Friday, June 24, 2005 11:03 AM > To: Clustering, compute farming & distributed computing in life science > informatics > Subject: Re: [Bioclusters] topbiocluster.org > > > On 24 Jun 2005, at 4:06 pm, Brodie, Kent wrote: > > > Taking into account the whole pipeline (including networked I/O, > > formatdb, etc) is both a great idea and will give much more realistic > > results. > > > > I also think that a collection of data would be a catalyst for great > > future discussions and questions, e..g, "how the heck did you get your > > formatdb to run so fast on the 20K data?", the responses would then > > give > > the rest of us who may be a bit behind in these things great > > insight and > > ideas. > > > > I'd be VERY interested to see if anyone has results from using cluster > > filesystems, for example..... > > Cluster filesystems have *drastically* cut our data distribution > time. We can distribute a new multi-GB genome data set to all the > machines that use cluster filesystems in a few minutes. The old RLX > blades, which have to rely on the hierarchy of rsync processes to > which James referred, trail in a dismal few hours later. > > They've also increased performance when running jobs; the machines > can suck data over the filesystem's GB ethernet faster than the > individual spindles could supply the data locally. > > We've been using cluster filesystems (specifically, GPFS) in > production since October 2003, for the static datasets; blastables > and so on. This is going to continue, and we've been so pleased with > it as a method, that it's going to be extended. The number of nodes > per cluster filesystem (currently 14) will be expanded, hopefully to > the entire cluster. Scratch filesystems for the cluster will be > moved to GPFS or Lustre, rather than NFS, which is where they are > currently. We're not wedded to GPFS - Lustre looks good too. > > LSF is already running off a GPFS cluster filesystem so that it can > fail over without the performance sucking because of NFS (yay! No > more LSF masters on Tru64! Woohoo!) > > The dream of a 1000+ node cluster entirely without NFS takes a step > closer to reality... > > I'd be happy to run one of James' mini pipelines on Sanger's cluster, > if I could actually persuade Ensembl to give me a couple of hours of > completely clear air to actually get the benchmark done. :-) > > Tim > > -- > Dr Tim Cutts > Informatics Systems Group, Wellcome Trust Sanger Institute > GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233 > > _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters