What are you using to merge the outputs ? (and to manage the stats...) =============================================== David Coornaert (dcoorna at dbm.ulb.ac.be) Belgian Embnet Node (http://www.be.embnet.org) Université Libre de Bruxelles Laboratoire de Bioinformatique 12, Rue des Professeurs Jeener & Brachet 6041 Gosselies BELGIQUE Tél: +3226509975 Fax: +3226509998 =============================================== Tim Cutts wrote: > > On 24 Sep 2005, at 7:40 pm, Warren Gish wrote: > >>> Hi, I'm the administrator the bioinformatics laboratory at Université >>> du Québec à Montréal. I have a room filled with dual P4 3GHz >>> workstations. The boxen are dual booted with Windows and >>> GNU/Linux but >>> they spend most of their time on GNU/Linux. Each box have 2Gb of RAM >>> so I expected decent performance with local BLAST jobs but the sad >>> truth is that my jobs are run about 4 times slower with blast2 than >>> with blastcl3 with the same parameters. The hard drive is IDE so I >>> suspect a bottle neck here. >>> >> Make sure the IDE drivers are configured to use DMA I/O, but if repeat >> searches of a database are just as slow as the first time it is >> searched, >> then experience indicates the problem is that the amount of free memory >> available is insufficient to cache the database files. Database file >> caching is a tremendous benefit for blastn searches. If your jobs >> too much >> heap memory, though, no memory may be available for file caching. > > > I often see caching problems if people have written their pipeline > code incorrectly too; people naturally tend to write things like: > > foreach $seq (@sequences) { > foreach $db (@databases) { > system("blastn ..."); > } > } > > which is, of course, exactly the wrong way round, and guarantees > trashing the disk cache every single time. > > It's worthwhile to break your databases into chunks which are small > enough for the entire thing to be cached on your compute nodes; until > recently, we always broke nucleotide databases into 800 MB chunks. > Of course, care then needs to be taken to get the statistics right > when running lots of sequences against the individual chunks. If it > fits your requirements, the automatic slicing that both blast > flavours can do might work for you, but we do it manually. > >> Use of more threads requires more working (heap) memory for the search, >> making less memory available to cache database files. If the >> database files >> aren't cached, more threads means more terribly slow disk head >> seeking as >> the different threads request different pieces of the database. If >> heap >> memory expands beyond the physical memory available, the system will >> thrash. >> With WU-BLAST, multiple threads are used by default, but if memory >> is found >> to be limiting, the program automatically reduces the number of threads >> employed, to avoid thrashing. > > > That's sensible - I didn't know it did that. > > Tim >