[Bioclusters] BioPerl 1.2.3 and memory handling

Wed Dec 1 09:24:53 EST 2004

I'd like to bring the conversation back to Al's original problem with 
BTBlastall.  Btblastall is a program of the type that runs a single 
query on many compute nodes of a cluster, using a pre-split blast 
target set (see also MPI Blast, the Paracel's former offerings, and a 
variety of other solutions).    The short version is that while fixing 
issues in BioPerl is always good, the flaw is really with our algorithm 
for merging results back together after a bunch of reports have been 
generated on subsets of the blast target

The algorithm we used works great for small jobs:
--------------------------------------------------------------------
For all blast reports which are part of this result
   For all queries within that report
     For all hits within that query
       hit_hash{query_unique_id . hit_unique_id} = hit
       @unique_queries{query_unique_id}++
     roF
   roF
roF

Foreach query (@query_unique_id)
   print out a pseudo blast report based on hit_hash
hcaeroF
----------------------------------------------------------------------

People currently see problems with moderate sized results.  We could 
undoubtedly put band-aids on the problem by "fixing" bioPerl, 
compressing our hash, rewriting in a compiled language, or whatever.  
The real problem is that this single pass algorithm has to hold all of 
the hits in memory at one time.  It will therefore never be suitable 
for truly monstrous jobs.

One answer would be to modify our merge step to be multiple passes 
(independent merges for each query).  Another is to, for sufficiently 
large jobs, split on query sequences instead of the target dbs and 
eliminate the merge entirely.  This second alternative is the direction 
I recommend for really large BLAST runs, since the merge step 
represents computational overhead which can be removed entirely...and 
chasing the memory issue skirts the real problem.

-Chris Dwan
  The BioTeam

On Nov 30, 2004, at 4:10 PM, Mike Cariaso wrote:

> Al,
>
> While I'm certainly learning a bit from the
> bioperlers, we seem to have strayed a bit from your
> original question.
>
> If you don't need to see the alignments, you might
> wish to investigate if your software can be made to
> use blast's table output ("blastall -m 8" I believe).
> Perhaps the bioperl parser will recognize the format,
> and will be able to complete since it will have no
> alignments to eat up memory. If its not automatically
> recognized writing a parser for this might be pretty
> simple.
>
> If you need the alignments but don't need all the
> statistics, you might wish to use the BPLite parser,
> which manages to handle some reports that the SearchIO
> parser cannot.
>
> If you need both, you can probably still use BPLite,
> but you'll need to do a bit more work.
>
> Sadly, I don't believe that the XML (-m 7) format is
> handled by bioperl yet. That would probably solve all
> of these issues.
>
>
> That'll teach you to ask a question! ;)
> Mike Cariaso
>
>
>
>
> --- Al Tucker <act at comm.rockefeller.edu> wrote:
>
>> Hi everybody.
>>
>> We're new to the Inquiry Xserve scientific cluster
>> and trying to iron
>> out a few things.
>>
>> One thing is we seem to be coming up against is an
>> out of memory
>> error when getting large sequence analysis results
>> (5,000 seq - at
>> least- and above) back from BTblastall. The problem
>> seems to be with
>> BioPerl.
>>
>> Might anyone here know if BioPerl is knows enough
>> not to try and
>> access more than 4gb of RAM in a single process (an
>> OS X limit)? I'm
>> told Blastall and BTblastall are and will chunk
>> problems accordingly,
>> but we're not certain if BioPerl is when called to
>> merge large Blast
>> results back together. It's the default version
>> 1.2.3 that's supplied
>> btw, and OS X 10.3.5 with all current updates just
>> short of the
>> latest 10.3.6 update.
>>
>> - Al Tucker
>
>
> =====
> Mike Cariaso
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters