[Bioclusters] Best ways to tackle migration to dedicated cluster/Farm

Ross Crowhurst bioclusters@bioinformatics.org
Wed, 24 Mar 2004 09:35:51 +1200


I run a small PC farm to handle annotation of DNA sequence (primarily ESTs =
* about 550,000 at present). The current model for this farm is to use "wha=
tever is available" to us within our Institute in respect of compute capaci=
ty. Consequently the farm has a few (19) 24x7 nodes and 50 odd nodes that j=
oin the farm after normal laboratory hours. The nodes within this farm are =
Intel PCs of varying types (RAM 512 MB, CPUs PIII/800 MHz * P4/2.8 GHz, mos=
tly 40-80 GB local disks). We have 600 odd PCs in our Institute but I can n=
o longer use this pool to expand capacity as our new CIO is dead against us=
ing this distributed model which seems a pity since 80 new DELL 2.8 GHz PCs=
  arrived on our site this month. The disadvantages of the current model is=
 the necessity to insert a second hard disk to support Linux (primary OS is=
 Windows XP) and the unix system administration overhead of managing OS upd=
ates to "transient" nodes (all bioinformatic database updates/script change=
s etc are rolled out automatically from the master node when a node comes o=
n line or as required when online in the farm).=20

Our CIO favours a fully dedicated system which would be great for us except=
 his goals may not be identical to ours - he has cost drivers, we have perf=
ormance drivers. To this end I have recently compared our existing farm out=
put to 1U test machines from two major vendors (single CPU and dual CPU). T=
he single CPU machine was a 2.66 GHz P4/1GB Ram model. The dual Xeon 2.8 GH=
z CPU machine was trialed initially with 512 MB per processor,  then 1 GB R=
AM, and then 2 GB RAM per processor. Comparisons to existing farm nodes sho=
wed the single CPU test system performed similarly to 2.8 GHz P4 (512MB RAM=
) with little benefit from the additional RAM. The dual Xeon only showed li=
ttle difference from this when the RAM was less then 2 GB per processor. Wh=
en increased to 2 GB per processor, 2-6 fold increases in output where seen=
 in blastn vs "nt", blastx vs "nrdb90", and interproscan (dependent on task=
). The trial used our live production pipeline so each node does not receiv=
e the same jobs. However, this is compensated by the fact that the runs wer=
e in the range of 8000-16000 jobs per node. Currently we are not splitting =
the large databases for blast (hence the performance gain seen for the 2 GB=
 per processor model). We are getting other test models in but really seems=
 sensible to tap into the wealth of knowledge that is already within the Bi=
oCluster community.

I have been scanning this newsgroup in an attempt to gain a better idea of =
what others are implementing as solutions (1CPU vs 2 CPU; memory per proces=
sor etc) and would welcome any input that you wish to give. In particular w=
hat is the minimum memory configuration per processor that is being used fo=
r blastn vt "nt" where the database in both being split and not being split.

Also, our existing farm uses "node pull". That is, as nodes come online a p=
rocess on each node requests from a mysql configuration database the type o=
f jobs that the node is capable of undertaking, then requests a chunk of jo=
bs from a mysql database functioning as a jobs queue. The nodes process the=
ir chunk of jobs and post parsed results directly back to the appropriate m=
ysql database. All blasts are performed by piping from the control script t=
o blast then piping results back in for parsing. No physical sequence/repor=
t files are read or written to local disk (except for interproscan). I used=
 to use NFS and have the nodes send results files back to an NFS server whe=
re they were parsed to database but that is incredibly slow compared to the=
 system I now operate. The "node pull" system seems ideal for our current e=
nvironment but if we move to a farm/cluster that is available 24x7 there ma=
y be a better way to do it (use SGE, standard cluster queuing systems etc).=
 If I move to splitting databases then its seems I am back to using NFS, ge=
neration of physical reports and parsing these on one or more servers (pars=
ing itself could be a new job type and merged blast reports redistributed t=
o the cluster to parse?). Is there a consensus on the best or most appropri=
ate way to tackle this in a dedicated cluster environment? I would welcome =
input on this as well.

Apologies if this is "old hat" to many of you.=20


The contents of this e-mail are privileged and/or confidential to the
named recipient and are not to be used by any other person and/or
organisation. If you have received this e-mail in error, please notify=20
the sender and delete all material pertaining to this e-mail.