[Bioclusters] Linux cluster storage question (SAN/NAS/GPFS)

James Lowey bioclusters@bioinformatics.org
Thu, 19 Aug 2004 13:43:05 -0700

Anand S Bisen Wrote:

I wanted to know which is a better alternative for a cluster of 48 nodes
(dual processor) that is working 24x7 for life science problems dealing
with extensive I/O's (small files) for performance. The kind of I/O's i
am talking about is small file read and writes say (10-20kb) each and
10000's of these operations simultaneously on the file system. How well
does a distributed file system like GPFS on SAN works or a NAS storage
We are in the process of designing a cluster for life science related
problem that will work on 10'000's of file's simultaneously from across
the linux cluster and we are hung up on the storage options the pro's
and con's of (GPFS on SAN) or (NAS device). If some body could point me
to a right direction it would be great because as i read from few sites
they say NAS devices are more preferred option but i could'nt find the
reasons to support either one of them.

From a pure performance perspective using a filesystem such GFS or
Lustre on direct attached SAN nodes=20
would be #1, however this could be cost prohibitive and would require a
fair amount of administrative overhead.=20
If you are doing a lot of I/O's I would not recommend using SATA drives
as the performance of doing lots of transactions will degrade very
quickly once cache on your SAN front end is exhausted, SCSI would be the
only way to go here.

I would definitely not recommend going with any NFS solution as this
type of I/O will bring your filer to max capacity in a hurry. (Unless
you buy very high-end load-balanced systems)

Some other issues to take into account are sharing bandwidth between
file services and the actual programs running, some codes are fairly
network intensive, and MPI is very sensitive to latency. So once again
from a pure performance standpoint, direct attached disk is the way to

If it is cost prohibitive to build this kind of infrastructure, I would
recommend using IBM's GPFS on a separate network from the computational
network. I have been using GPFS for around a year and have been pleased
with the performance and scalability.  But the thing to remember with
this is that IBM will charge for support on a yearly basis, so this can
end up costing quite a bit of money over the long haul. (However the
other solutions would no doubt require similar support)

So to sum up:
Cost is no object:  Direct SAN attached disk with Parallel file system
such as Lustre

Hardware cost doesn't matter, admin costs are limited:  Buy a big NetApp
or BlueArc filer

Constrained to a smaller budget for both admin and hardware: Buy a
decent SAN, and front-end it with GPFS, Lustre, or GFS.  (I recommend
GPFS from experience, I can't say with Lustre or GFS)

James Lowey
Lead, High Performance Computing Systems
TGen,  The Translational Genomics Research Institute
400 N. Fifth Street,  Suite 1600
Phoenix, AZ 85004