[Bioclusters] breaking up NCBI databases

Joseph Landman bioclusters@bioinformatics.org
01 May 2003 19:40:23 -0400


On Thu, 2003-05-01 at 18:29, Jeremy Mann wrote:
> I am curious if any knows of any commercial or open source solution to
> breaking up the NCBI dbs into various sizes. Here, our present solution is
> NFS mounts of /ncbi to each cluster node. Today, we gave a go at
> submitting numerous BLAST jobs with PBS. Boy, talk about a complete
> performance drain (namely nfsd). I haven't seen so many switch lights non
> stop in a LONG time ;)

Hi Jeremy:

  I strongly recommend moving the database indices to local storage.  
There are very few cases where it makes sense to run with the db info on
shared storage.  If you are using 100 Mb, you max out at about 12 MB/s
on the network into the compute node.  If you have gigabit everywhere,
life is still painful, as your file server network is now the choke
point.

  Local storage (an IDE disk) is usually capable of 20-40 MB/s (2 - 4 x
the 100 Mb maximum speed).  If you break your databases up small enough,
the machine can hold it all in buffer cache, but you still have to do
the initial read.  Reading 1 GB at 12 MB/s takes 85 seconds, and reading
that same GB at 33 MB/s (my typical IDE drive) takes 31 seconds. 
Moreover, there is no contention for that 33 MB/s.  For the 12 MB/s, if
I have N requesters asking for files (same or different, doesn't matter)
out of a 100 Mb NIC, I will on average get 12/N MB/s bandwidth.  I get
no such thing for local database indices.

> I have been experimenting with mpiBLAST (using 4 test nodes). So far its
> worked extremely well. I like the fact that its formatdb formats equal
> segments for how many nodes you specify. Now this database is only useful
> when using mpiBLAST. I want to try and use one version of the database for
> all immplementations (we also use wwwblast and command line tools). And
> since the BLAST dbs are precompiled, we have to use fastacmd to revert to
> unformatted, THEN run the mpiblast formatdb. Now do this for 20 nodes ;(
> If we choose this method we would have to update once a week instead of
> the present nightly updates.

I might suggest a quick look here
ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ .  

I am working on a simple script to handle the pulls, db-segmentation,
and formatting.  I will post a pointer to it when I am done.

> Now there has been talk lately on this list regarding installing an extra
> harddrive in each node and use that as the database drive. After today I
> am completely sold on doing this seeing how drives are very, very cheap.

Good!

> I guess the ending question is, we would like to use one database (with
> equal segments for 22 nodes) for parallel, www blast and command line
> BLAST programs. Does such a thing exist or am I just wishing?

You will need to abstract the formatdb by wrapping it, and doing
distribution to local cache directory.  This is not hard.  It could take
the same arguments as the regular formatdb, all you have to add is the
pre-distribution mechanism.

> Oh, one more catch, we also use SAM, PFam and GCG which use our existing
> NCBI dbs.

Shouldnt matter that much, though you will need to keep the original
db's around for those.

-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman@scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615