On Thu, 2003-01-16 at 23:26, Chris Dagdigian wrote: > The biggest performance bottleneck in 'bioclusters' is usually disk I/O > throughput. Bio people tend to do lots of things that involve streaming > massive text and binary files through the CPU and RAM (think running a > blast search). The speed of your storage becomes the rate limiting > performance bottleneck. Often there will be terabytes of this sort of > data laying around so the "/data" volume is usually a NFS mount. Note: moving to gigabit simply moves the pain and stress to another location in the cluster, it doesn't "solve" this problem. You have to be careful in choosing which problems you wish to deal with... > If disk I/O is not your bottleneck than memory speed and size will > likely be next bottleneck. Some applications like blast and sequence > clustering algorithims will always be better off with as much physical > RAM as you can cram into a box. Other applications are rate-limited by > memory access speeds which is why Joe recommends fast DDR memory for > users who need high mem performance. I tell every to think about the highly over-simplified equation for the time it takes to move some data, and compute with it. T(total) = T(moving-data) + T(computing-with-data) The T(moving-data) portion may be simply represented by T(moving-data) = Memory_Latency + (size_of_stuff_to_move)/memory_bandwidth. If you go walking randomly through memory, the latency will limit you. If you go walking with a cache friendly memory access pattern, the memory bandwidth will limit you. DDR buys you good latency and good bandwidth. RDRAM buys you ok latency and great bandwidth. I have heard a quote attributed to Henessey at Stanford: "You can always buy bandwidth, but latency is forever..." The computing time is also interesting. The net idea is that MHz is only part of the performance picture, and when you work on real problems you quickly discover that it is the other aspects that are more important than MHz. [...] > General rule of thumb: no matter how big, fast and expensive the storage > solution is you will always be able to drive it to its knees with enough > cheap cluster nodes hitting it. This is something that has to be lived > with. I have a mantra which might be useful here: "Fast storage is local storage" This is not universally true, but it is a good working assumption in many cases. You have costs to pay with this mode of thinking though. With a striped basic IDE system (e.g. not working too hard on it) I can get 70+ MB/s sustained on large sequential reads (think of reading in these large GB sized files for analysis). I wrote up an analysis framework a while ago on this. You can pull that from http://scalableinformatics.com/scalable_fs_part_1.html > There are faster solutions out there but they get expensive (think > having to run switched fibre channel to every cluster node) and are > often very proprietary. People in the 'super fast' storage space > include: DataDirect, BlueArc, Panasas, etc. etc. Some of these designs are quite compelling BTW. If you really need scalable global name-space IO, you can do it. It has some specific advantages over local IO. The aforementioned costs are the costs to move data between the compute nodes. With a global namespace, you dont have these problems. [...] > A linux box with some SCSI or ATA drives serving as a NFS fileserver > server will cost just a few thousand dollars A promise UltraATA card is cheap, and one of the fastest around. > A linux box with a terabyte of NexSan ATA RAID attached via fibrechannel > or SCSI will cost about $12-$15,000 [...] > > > >>>Also very much overlooked is the issue of cluster management. This > >> > >>tends to guide the choice of Linux distribution. Management gets to be > >>painful after the 8th compute node, the old models don't work well on > >>multiple system image machines. > > > > > I use and love systemimager (www.systemimager.org) for automating the > process of managing my full-on 'install an OS on that node from scratch' > as well as my 'update this file or directory across all cluster nodes' > needs. It's a great product. System Imager is one of several products you can use. It is in the more "roll your own solution" category. This is great if you want to do this. There are other methods which include slightly different tools for this work, including RedHat kickstart, and other related items. Kickstart has its advantages and disadvantages over other solutions. But it works quite nicely. I use a variety of tools, but right now my favorite is the Linux ROCKS distribution (http://www.rocksclusters.org) which is an enhanced RedHat 7.3 distribution. It is one of the few cluster distributions that do things right from the outset. They use kickstart, and a programmatically generated ks.cfg file. No fiddling with it by hand. The downside to kickstart and RH based distributions is the rather poor disk partitioning decisions made by anaconda. There are simple methods to work around it (within ROCKS it is simple), but for most RH kickstart users, the assumptions made, and the partition layouts are ... uh ... interesting. [...] > For cluster management you need to treat you compute nodes as anonymous > and disposable. You cannot afford to be messing with them on an > individual basis because your admin burden will scale linearly with > cluster size. This is the philosophy behind ROCKS. [...] > > Joe, I know this could easily turn into a thick book :), but how does one > > get more educated about these things? Pay Chris (Bioteam) and I (Scalable Informatics) to come out and have a beer with you ... it will be a rather long beer.... :) More seriously, attend Chris' talk at O'Reilly, grab his slides. Bug all of us. If there is enough interest, and people/organizations wouldn't mind paying for this, then Bioteam/Scalable Informatics/Others might be persuaded to organize some sort of conference call or seminar on this. It wouldn't be cheap, but it might be helpful to the biocluster community. > I'm probably biased but I've heard lots of people speak positively about > reading the archived threads of this mailing list. Many of the questions > you are asking about have been debated and discussed in the past on this > very mailing list. The list archives are online at > http://bioinformatics.org/pipermail/bioclusters/ > > > I'm also going to be rehashing alot of this stuff in a talk at the > upcoming OReilly Bioinformatics Technology conference. May be of > interest to some people and really obvious and boring to others. > > > > > Another question that I am curious about is Itanium 2 and using them in a > > cluster - any experiences with these? How about bioinformatics software - > > any benefits in your regular programs like blast, clustalw... when running > > on an itanium system? I have to be careful here. Itanium and Itanium2 give you a really nice sized memory system to play with. They give you a very nice memory bandwidth for sequential access. I am not sure I can discuss the benchmarks I have run in the past. I think it is safe to say that I don't think they are going to push the IA32 based machines off the performance lead (blast, clustalw, hmmer, etc) for a while. The G4 based machine performance was both interesting and suprising (in a positive sense). Granted that some compiler assistance for the SIMD features of the chip were used, the results are real and for similar calculations, people seem to be getting good performance. Interestingly, the most overlooked issue I have seen thus far, pertinant to this discussion are the power and cooling issues. Athlons generate about 70-160 W /node (singles vs duals). Xeon's generate 90-200 W/node (please correct me if I am wrong on these, I am getting them from totaling heat dissipation info). Itaniums generate a bit more. G4's generate somewhat less. The Hammer series will be closer to Xeon's in power from what I have seen (in public, no NDA stuff). You have to plan for cooling and power. This planning exposes how expensive these aspects can be if you are not set up for it. Site planning is critical. If you can plan during the building stages, great. If you have to renovate an existing infrastructure, this can be expensive. If you can't renovate, and you have to work with what you have, some measure of creativity will be needed to avoid problems. More cluster nodes = more power and cooling needs. More hotter chips = more failures if cooling is inadequate.