[Bioclusters] Gridlet test of BLAST using datagrid directories.

Rick Westerman bioclusters@bioinformatics.org
Mon, 02 Dec 2002 16:30:40 -0500


Chris Dwan writes, concerning Don Gilbert's gridlet that downloads 
information to each node on an as need basis:

>   The fact that the target needs to be re-formatted
>every time we gain or lose a compute node seems particularly iffy.

    I had this concern as well:  why go through the re-format (i.e., 
formatdb) each time you wish to run a job?  I know that my current 
formatting of the databases takes a long time every week.   However in 
trying out Don's gridlet I was pleasantly surprised to find that the format 
took an insignificant amount of time compared to the blast search 
itself.  This was using datasets of 2000 sequences and input of 50+ 1000bp 
sequences. Of course reformatting a large dataset just to use against an 
input of 1 or 2 sequences would be time inefficient.

    Naturally there is a lot of other framework needed aside from the 
gridlet.  Chris mentioned a few as well as the existing "queuing system of 
your choice is used to schedule jobs onto nodes, manage transient and 
permanent failures, stage data, and all that other neat stuff."   There is 
no reason, in my mind, that such a queuing system could not also handle 
jobs that split up the databases dynamically.  Such splitting up may become 
more necessary as the data becomes larger than our computers' 
memory.  Already I have a PC cluster with very limited memory (but it was 
"free" to me) that is limited in what datasets I can submit to it.

     In summary, I think that the gridlet might be a worthwhile tool.





-- Rick

Rick Westerman
westerman@purdue.edu

Phone: (765) 494-0505                         FAX: (765) 496-7255
Department of Horticulture and Landscape Architecture
625 Agriculture Mall Drive
West Lafayette, IN 47907-2010
Physically located in room S049, WSLR building

Bioinformatics specialist at the Genomics Facility.

href="http://www.genomics.purdue.edu/~westerm"