There are special cases where GFS will work very well for bio databases (small databases which can be easily cached on each GFS server node). The problem occurs when you write to this database, or the database is too large to live in cache. Secondly, Sistina itself recommends a hybrid approach for clusters, call this a federated approach. This will seem a somewhat roundabout discussion, but it all centers on what your bottlenecks are. Sticking in a faster disk into a system where the disk controller is the limiting factor does not make the IO system faster. The same principal holds true here. The issue is the smallest pipe bandwidth. That is usually found where the smaller pipes aggregate, betweens the server and the switch. Chris's concern with IO goes to the understanding that non-local IO (any sort of remote IO) will be bounded by the slowest pipe in the process chain. So if you have 100 clients on a 100 Base T (~10 MB/s per connection) and one server with a gigabit interface (~100 MB/s per gigabit connection), and you are seeing a constant load (and duty cycle on the usage, think of this as the percent of time the resource is used) on the file server from each client, the bottleneck is very likely to be the gigabit interface. A single Gigabit interface can handle ~10 of the 100 Base T interfaces running flat out, with 100% utilization of the 100 Base T's. If you double this number of 100 Base T's, you are oversubscribing the available bandwidth, which forces additional latencies into the system. Your wall clock times for remote file operations will increase. Another issue you have to worry about is the bandwidth of your backplane on your switch. Not all switches are created equal. You have to be careful of running out of backplane bandwidth speed. GFS rests on the network fabric. For best performance, they recommend a fibre channel fabric (high bandwidth, moderate latency). For large clusters, you look at building SAN and related IO systems (which Chris can talk about at length if you ask him). The point is that software will not overcome the fundamental hardware implementation issues. It may blunt some of the pain, but hardware not well suited to the task will not be fixed by adding software. Mosix and SSI variants (including DSM, etc) are interesting, but will not overcome hardware limitations. Same for GFS, PVFS. GFS can help ameliorate some pain, and ease some design considerations (for certain things, it is really quite good). Joe On Mon, 2002-05-13 at 03:00, Ivo Grosse wrote: > Hi Chris and other bioclusterers, > > I just stumbled upon http://www.sistina.com/products_gfs.htm, which > states that one advantage of GFS (combined with Mosix?) is that it > > "Eliminates NFS Bottlenecks," > > and as example these guys list > > "Life Sciences" > > and > > "Shared BLAST databases." > > Independently of the fact that these guys charge $1000/node (Am I > correct?), I wonder if their claims are correct? Can Mosix -- by using > GFS -- really achieve a high I/O throughput? > > A related question is: does Mosix work together with PVFS? > > Best regards, Ivo > > > >>> > chris dagdigian bioclusters@bioinformatics.org > Tue, 05 Mar 2002 11:16:51 -0500 > > ... > > (1) My limited experience with MOSIX has me believing that in the MOSIX > world a process that is doing heavy I/O operations will never get > migrated across to a less loaded machine. This alone is enough for me > to > not consider MOSIX/SSI for bioclusters because all of the ones I have > built so far are very, very often used for IO-bound embarassingly > parallel jobs (blast, genscan, etc. etc). Am I totally wrong? Anyone > out > there using SSI/Mosix systems for hardcore biology stuff? > <<< > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > http://bioinformatics.org/mailman/listinfo/bioclusters