[Bioclusters] Grid BLAST

Mon Sep 26 05:22:43 EDT 2005

What are you using to merge the outputs ? (and to manage the stats...)

===============================================
David Coornaert    (dcoorna at dbm.ulb.ac.be)

Belgian Embnet Node (http://www.be.embnet.org)
Université Libre de Bruxelles

Laboratoire de Bioinformatique
12, Rue des Professeurs Jeener & Brachet
6041  Gosselies
BELGIQUE

Tél:  +3226509975
Fax:  +3226509998
===============================================

Tim Cutts wrote:

>
> On 24 Sep 2005, at 7:40 pm, Warren Gish wrote:
>
>>> Hi, I'm the administrator the bioinformatics laboratory at Université
>>> du Québec à Montréal.  I have a room filled with dual P4 3GHz
>>> workstations.  The boxen are dual booted with Windows and
>>> GNU/Linux but
>>> they spend most of their time on GNU/Linux.  Each box have 2Gb of RAM
>>> so I expected decent performance with local BLAST jobs but the sad
>>> truth is that my jobs are run about 4 times slower with blast2 than
>>> with blastcl3 with the same parameters.  The hard drive is IDE so I
>>> suspect a bottle neck here.
>>>
>> Make sure the IDE drivers are configured to use DMA I/O, but if repeat
>> searches of a database are just as slow as the first time it is  
>> searched,
>> then experience indicates the problem is that the amount of free  memory
>> available is insufficient to cache the database files.  Database file
>> caching is a tremendous benefit for blastn searches.  If your jobs  
>> too much
>> heap memory, though, no memory may be available for file caching.
>
>
> I often see caching problems if people have written their pipeline  
> code incorrectly too; people naturally tend to write things like:
>
> foreach $seq (@sequences) {
>     foreach $db (@databases) {
>         system("blastn ...");
>     }
> }
>
> which is, of course, exactly the wrong way round, and guarantees  
> trashing the disk cache every single time.
>
> It's worthwhile to break your databases into chunks which are small  
> enough for the entire thing to be cached on your compute nodes; until  
> recently, we always broke nucleotide databases into 800 MB chunks.   
> Of course, care then needs to be taken to get the statistics right  
> when running lots of sequences against the individual chunks.  If it  
> fits your requirements, the automatic slicing that both blast  
> flavours can do might work for you, but we do it manually.
>
>> Use of more threads requires more working (heap) memory for the  search,
>> making less memory available to cache database files.  If the  
>> database files
>> aren't cached, more threads means more terribly slow disk head  
>> seeking as
>> the different threads request different pieces of the database.  If  
>> heap
>> memory expands beyond the physical memory available, the system  will 
>> thrash.
>> With WU-BLAST, multiple threads are used by default, but if memory  
>> is found
>> to be limiting, the program automatically reduces the number of  threads
>> employed, to avoid thrashing.
>
>
> That's sensible - I didn't know it did that.
>
> Tim
>