[Bioclusters] topbiocluster.org

Fri Jun 24 12:47:16 EDT 2005

I'd like to hear more about your experience with clusterfs---   I've
been here in this role (managing the technology of the bio stuff here),
and am disappointed that it's set up with nothing fancier than a bunch
of compute nodes that her their data off of an NFS file server.     

I assume the general concept is that each of the nodes "donates" some
chunk of local disk space to the greater whole, and they all share the
I/O tasks?   For example, we have a bunch of high performance nodes that
have 2x73GB drives...   I could easily give up one drive from each
system for the cluster file environment......

> -----Original Message-----
> From: bioclusters-bounces+brodie=mcw.edu at bioinformatics.org
> [mailto:bioclusters-bounces+brodie=mcw.edu at bioinformatics.org] On
Behalf
> Of Tim Cutts
> Sent: Friday, June 24, 2005 11:03 AM
> To: Clustering, compute farming & distributed computing in life
science
> informatics
> Subject: Re: [Bioclusters] topbiocluster.org
> 
> 
> On 24 Jun 2005, at 4:06 pm, Brodie, Kent wrote:
> 
> > Taking into account the whole pipeline (including networked I/O,
> > formatdb, etc) is both a great idea and will give much more
realistic
> > results.
> >
> > I also think that a collection of data would be a catalyst for great
> > future discussions and questions, e..g, "how the heck did you get
your
> > formatdb to run so fast on the 20K data?", the responses would then
> > give
> > the rest of us who may be a bit behind in these things great
> > insight and
> > ideas.
> >
> > I'd be VERY interested to see if anyone has results from using
cluster
> > filesystems, for example.....
> 
> Cluster filesystems have *drastically* cut our data distribution
> time.  We can distribute a new multi-GB genome data set to all the
> machines that use cluster filesystems in a few minutes.  The old RLX
> blades, which have to rely on the hierarchy of rsync processes to
> which James referred, trail in a dismal few hours later.
> 
> They've also increased performance when running jobs; the machines
> can suck data over the filesystem's GB ethernet faster than the
> individual spindles could supply the data locally.
> 
> We've been using cluster filesystems (specifically, GPFS) in
> production since October 2003, for the static datasets; blastables
> and so on.  This is going to continue, and we've been so pleased with
> it as a method, that it's going to be extended.  The number of nodes
> per cluster filesystem (currently 14) will be expanded, hopefully to
> the entire cluster.  Scratch filesystems for the cluster will be
> moved to GPFS or Lustre, rather than NFS, which is where they are
> currently.  We're not wedded to GPFS - Lustre looks good too.
> 
> LSF is already running off a GPFS cluster filesystem so that it can
> fail over without the performance sucking because of NFS (yay!  No
> more LSF masters on Tru64! Woohoo!)
> 
> The dream of a 1000+ node cluster entirely without NFS takes a step
> closer to reality...
> 
> I'd be happy to run one of James' mini pipelines on Sanger's cluster,
> if I could actually persuade Ensembl to give me a couple of hours of
> completely clear air to actually get the benchmark done.  :-)
> 
> Tim
> 
> --
> Dr Tim Cutts
> Informatics Systems Group, Wellcome Trust Sanger Institute
> GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters