[Bioclusters] SAS on clusters

Moxon, Bruce bioclusters@bioinformatics.org
Fri, 19 Mar 2004 09:56:03 -0800

I ran SAS on an IBM SP2 a few years ago (basically a cluster with a
proprietary high-speed interconnect).

Ran it against DB2(AIX) in a "data parallel" mode for data mining apps.
Three phases:
	1 data characterization
	2 model generation
	3 model application (i.e. applying a scoring algorithm to db rows)

1 and 3 can be done in a "data parallel" fashion -- with processors pulling
partitions of the overall database and processing them independently.  This
married well with the shared nothing DB2 database architecture.

2 typically required putting together one comprehensive dataset and running
on a smallish SMP (4- or 8-way).  You can pull datasets and cat them
together, try to use log-combining approaches to do this more rapidly,
and/or try to find a very fast file system that allows you to do "concurrent
write" from multiple clients into a single file (we happen to have one at
Panasas; there are some other efforts in this area as well).

Besides the licensing issues (you need to license SAS on every node), the
biggest challenges were around data partitioning and subsetting strategies.
If you're running against a parallel database engine, do as much processing
as you can in SQL before pulling the data out with SAS/CONNECT.

You'll also want to try and exploit the scale-out / data parallel
architecture, which may mean heavier hardware or innovative approaches to
model generation if you hope to accelerate that phase.  There was some
research on parallel/distributed model generation emerging when I was
looking at this a few years ago.


Bruce Moxon
Chief Solutions Architect, Panasas Inc.
Delivering the premier storage system for scalable Linux clusters