[Bioclusters] Average Pairwise Scores for reciprocal Blast hits

Jason Stajich jason.stajich at duke.edu
Thu Jun 2 08:58:15 EDT 2005


To avoid mapping everything in memory try using DB_File to tie the  
hash to a file - you still use the hash as normal but all the fetch/ 
store calls are done through BerkeleyDB (or equivalent) on the  
flatfile. But you'll have to adjust your strategy of Hashes-of-Hashes  
since you can only do key-value storage and not HoH.  When I do this  
sort of thing I use a B-TREE with DB_File so that each key can have  
multiple values -- it requires a loop on the available values for a  
key but this is fairly small number for each key since the matrix is  
generally sparse (most proteins do not have more than a hundred hits  
depending on your sequences).

Other options include storing the data in a database and changing  
your strategy to use DBI fetches.

-jason

--
Jason Stajich
Duke University
http://www.duke.edu/~jes12/


On Jun 1, 2005, at 8:25 AM, Intikhab Alam wrote:

> Hi,
>
> I performed All-vs-All blast for 214777 proteins and wanted to
> cluster these proteins. For clustering, I read that its better
> if we use average pairwise evalues for bi-directional hits.
>
> I wrote a perl script to read blast results, to store hit evalues
> for each query in a hash reference, to find bi-directionabl blast
> hits and calculate average pairwise evalues for them.
> I used hash references since these can be bit fast.
>
>
> I tested the following script on a machine with 2G of ram and on
> another with 94G of ram. It ran out of memory in both cases.
>
> Is it the case that it is not possible to hold this much data into
> a hash?
>
> Any suggestions to resolve the problem?
>
>
> Many Thanks,
>
> Intikhab Alam,
>
>
>
> I use the script as follows:
>
> To Fill the hash:
>
> while(<FILE>){
>     my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst,  
> $send, $evalue, $bit)= split /\s+|\t+/, $_;
>     $query_results->{$q}{$h}{evalue}=$evalue;
> }
> close FILE;
>
> To get average pairwise evalues:
>
> my ($c, $hits)=(0, 0);
> open F, ">pairwiseValues";
> foreach my $query (%{$query_results}){
>         if (exists ($query_results->{$query})){
>         $c++;
>         print "Query:$c:$query\n";
>         foreach my $hit (%{$query_results->{$query}}){
>                 if (exists ($query_results->{$query}{$hit}{evalue})){
>
>                 my $k=$query_results->{$query}{$hit}{evalue};
>                         if ((exists ($query_results->{$query}{$hit} 
> {evalue}) && ($query_results->{$hit}{$query}{evalue}))){
>                         my $i= $query_results->{$query}{$hit}{evalue};
>                         my $j= $query_results->{$hit}{$query}{evalue};
>
>                         $k=(($i+$j)/2);
>
>                         $query_results->{$query}{$hit}{evalue}=$k;
>                         $query_results->{$hit}{$query}{evalue}=$k;
>                         }
>                 $hits++;
>                 print F "$query\t$hit\t$k\n";
>                 }
>         }
>         }
> }
> close F;
> print "$hits hits read for $c queries \n";
>
> ----------------------------------------------
>
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>



More information about the Bioclusters mailing list