Hi everybody - Many thanks to those of you who responded to my request for feedback on your Linux cluster experiences a few weeks ago. The BioInform article based on responses from this list and other sources is attached below. Hope you find it useful. The print version of the newsletter has a page of snazzy graphs that detail some of the results, but since that won't work here, I added a breakdown after the text of the article. Apologies for the length. I'll be happy to mail out hard copies to anyone who requests one. Thanks again to everyone - Bernadette ...................................... Bernadette Toner Editor, BioInform GenomeWeb LLC 212-651-5636 firstname.lastname@example.org www.bioinform.com Bioinformatics Linux Clusters Gain Ground (Literally): Users Report Rapid Expansion Bioclusters are big. And most of them are getting bigger, according to a recent user snapshot compiled by BioInform. A relative rarity just three years ago, Linux clusters have quickly gained popularity in the bioinformatics community as an effective, low-cost, high-performance computing option. No longer limited to small, underfunded academic groups seeking compute power on the cheap, clusters have also taken root within biotechs and pharmaceutical firms looking for a scalable complement to other supercomputing resources. An entire sub-industry has sprouted as a result, with everyone from IBM to small, independent consulting firms making their services available to the biocluster community. BioInform recently polled 20 members of this growing population to get a better sense of how well Linux clusters are delivering on their promise. Users from 11 academic groups and nine biopharmas responded to an informal survey on how well the technology has lived up to their expectations so far and where it fits into their future infrastructure plans. More a pulse-taking exercise than a statistically valid portrait of the user landscape, our efforts did reveal some interesting trends. Most significantly, respondents indicated that they are taking full advantage of one of the primary selling points of the approach — its scalability — by regularly adding new CPUs to their existing clusters. Of the 20 groups surveyed, 17 have had a cluster in place for just three years or less, but 14 have already added new CPUs. The six groups who have not yet expanded their clusters said they plan to do so in the next year, as do seven other groups (see full results below). The average cluster size for the group increased from 81 CPUs for the initial installation to a current size of 426 CPUs. The starting size for academic clusters averaged 49 CPUs, vs. 126 for biotechs and pharmas. The current average size for the two sectors has grown to 167 CPUs for academic groups and 783 for biopharmas. Almost half of the clusters in our survey (nine) were originally home-grown systems. Three of these groups opted for a vendor or consultant when it came time to upgrade the system. One academic group that started out with a homemade system and then turned to a vendor for an upgrade at the two-year mark said it is going back to a homemade approach for round three. IBM, VA Linux, and Rackable came in as the most common choices for vendor-built systems, although it should be noted that firms like Linux Networx, Blackstone Computing, Microway, and others have sold a number of Linux clusters in the life science market, even though their customers did not respond to the survey. Keep it Simple (and Cheap) The home-grown flavor of our sample may explain the surprisingly poor showing of Platform’s pricey LSF when it came to distributed resource management systems. An equal number of respondents (six) opted for home-grown job scheduling software or the open source Sun Grid Engine instead, with PBS (five) and Mosix (four) following close behind. There were few surprises in the applications category, however. Proving that bioclusters are often dubbed “Blast farms” for a reason, 14 out of 20 groups run some flavor of Blast on their clusters, with the usual suspects of Fasta, HMMer, ClustalW, and RepeatMasker also appearing regularly. Interestingly, none of the survey respondents opted for a commercial parallel Blast application such as TurboGenomics’ (now TurboWorx) TurboBlast, Paracel Blast, or Blackstone PowerBlast. This, again, may be due to the DIY leanings of the sample group: Four respondents indicated that they had developed their own parallel versions of the bioinformatics workhorse. One user noted that these commercial offerings “are only wrappers around NCBI/Wu-Blast and we are not very happy with them because of the costs or the programs they use.” The Biocluster Bottom Line When it came to judging how much bang Linux clusters deliver for their buck, the results were a bit mixed. While more than half the respondents (11) indicated that the price/performance ratio of their cluster beat that of other computational options as well as their expectations, almost half (nine) said that issues such as cooling and maintenance costs bumped the total cost of ownership for the system a bit higher than anticipated. Those who did their homework before installing the cluster — by speaking to other users and investigating all their available options — were confronted with fewer shocks, however. One user was surprised by “how much heat the new AMD Athlon machines put out,” which led to “a few one-time startup expenses that relate to cooling.” Another simply noted that cooling is a “big deal.” For those who opted to build their own, many underestimated “the effort of building and administering a cluster by ourselves.” The head of an academic research lab noted that despite the benefits of the cluster, “I am quite dependent on the expertise of one person (the PhD student who built it, who will leave the lab shortly).” Another bemoaned the “time required to customize applications to run on clusters,” while one user wished for “more off-the-shelf cookbooks on how to set up and maintain a cluster.” Conversely, most who opted for vendor-installed systems seemed pleased with their choice. As one respondent put it: “The cost might have been much less if we had built the cluster ourselves. But this would have resulted in additional headaches in terms of maintenance of the machines. The cluster we have now has been running non-stop and no downtime in the last 12 months!” While maintenance costs, I/O bottlenecks, and fileserver limitations were listed among the top drawbacks of the technology, for the majority of survey respondents, Linux clusters deliver a combination of low cost, scalability, and speed that far outweighs these inconveniences. One user explained, “we were able to do full human genome analysis in one month using only 16 Intel Pentium machines. Now [our] 26 new machines can do the exact same analysis in two weeks. All 42 machines together should be able to do that same analysis in little over a week. All this, for a cost much less than one mid-range computing system that would have an equal number of processors and comparable computing time.” As another respondent summed up, the equation that describes why Linux clusters are growing so rapidly is very simple: “Need more power: buy more nodes.” — BT How long have you been using a Linux cluster? 0-12 months: 5 (25%) 13-24 months: 8 (40%) 25-36 months: 4 (20 %) 37+ months: 3(15%) Original number of CPUs: Range: 4-475 Average: 81 Most common number: 32 (4 responses) Current number of CPUs: Range: 30-2,300 Average: 426 Average increase: 5.2X Only 6 (30%) respondents did not add to their cluster. Of those who added CPUs, all but two more than doubled the size of their original cluster. Of those, 7 had more than a 5X increase in the number of CPUs, and 3 had an increase of 10X or more. Do you plan to add new CPUs to your cluster over the next year? Yes: 13 (65%) No response/maybe: 6 (30%) No: 1 (5%) The 8 respondents who specified their future plans planned to add an average of 195 CPUs, with a range of 20-1000. Processor type: All respondents said that all or part of their cluster used Intel chips Of those who specified: Pentium III: 14 (70%) Pentium 4/Xeon: 4 (20%) AMD Athlon: 3 (15%) Pentium II: 1 (5%) Mac G4: 1 (5%) Sparc II: 1 (5%) Distributed resource management*: In-house systems: 6 (30%) Sun Grid Engine: 6 (30%) PBS: 5 (25%) Mosix: 4 (20%) LSF: 1 (5%) Globus Grid: 1 (5%) Parasol: 1 (5%) Condor: 1 (5%) *Total is greater than 20 because systems are used in combination. Who built your cluster*? Homemade: 9 (45%) IBM: 4 (20%) Rackable Systems: 2 (10%) VA Linux: 2 (10%) Unspecified vendor: 2 (10%) Unspecified consultant: 1 (5%) Sun: 1 (5%) Western Scientific: 1 (5%) ICT: 1 (5%) Quant-X: 1 (5%) Of those who built their own cluster, 3 (33%) responded that they had hired a vendor for an upgrade. Applications: Blast (NCBI, Wu-Blast, Psi-Blast, etc.): 14 (70%) Fasta: 4 (20%) Protein folding/molecular dynamics: 4 (20%) HMMer: 3 (15%) ClustalW: 3 (15%) RepeatMasker: 3 (15%) Phred: 2 (10%) Phrap: 2 (10%) Emboss: 2 (10%) Sim4: 2 (10%) There were 4 respondents using parallelized versions of Blast. These were all in-house adaptations. None of the survey respondents indicated they were using a commercial parallel Blast application. Price/Performance: Better than expected: 11 (55%) As expected: 8 (40%) Worse than expected: 1 (5%) Total cost of ownership: Better than expected: 6 (30%) As expected: 5 (25%) Worse than expected*: 9 (45%) Of those who indicated that TCO was worse than expected, 3 (33%) indicated that this was due to higher than anticipated cooling costs. Advantages: Price/performance: 14 (70%) Scalability: 5 (25%) Speed/compute power: 3 (15%) Availability of embarrassingly parallel bioinformatics applications: 2 (10%) Drawbacks: Systems administration overhead: 7 (35%) I/O bottleneck: 3 (15%) None: 3 (15%) Hardware problems: 2 (10%) Lack of parallelized bioinformatics software: 3 (15%) Usability problems: 1 (5%) Lack of support for Linux: 1 (5%) Wish list: Better/more parallelized bioinformatics applications: 5 (25%) Shared memory: 4 (20%) Cheaper/better fileserver: 3 (15%) Improved distributed data mechanisms: 2 (10%) Better workflow management systems: 1 (5%) Faster/cheaper interconnects: 1 (5%) Instruction manual: 1 (5%) Copyright © 2001,2002 GenomeWeb LLC. All Rights Reserved.