[Bioclusters] Average Pairwise Scores for reciprocal Blast hits
Jason Stajich
jason.stajich at duke.edu
Thu Jun 2 08:58:15 EDT 2005
To avoid mapping everything in memory try using DB_File to tie the
hash to a file - you still use the hash as normal but all the fetch/
store calls are done through BerkeleyDB (or equivalent) on the
flatfile. But you'll have to adjust your strategy of Hashes-of-Hashes
since you can only do key-value storage and not HoH. When I do this
sort of thing I use a B-TREE with DB_File so that each key can have
multiple values -- it requires a loop on the available values for a
key but this is fairly small number for each key since the matrix is
generally sparse (most proteins do not have more than a hundred hits
depending on your sequences).
Other options include storing the data in a database and changing
your strategy to use DBI fetches.
-jason
--
Jason Stajich
Duke University
http://www.duke.edu/~jes12/
On Jun 1, 2005, at 8:25 AM, Intikhab Alam wrote:
> Hi,
>
> I performed All-vs-All blast for 214777 proteins and wanted to
> cluster these proteins. For clustering, I read that its better
> if we use average pairwise evalues for bi-directional hits.
>
> I wrote a perl script to read blast results, to store hit evalues
> for each query in a hash reference, to find bi-directionabl blast
> hits and calculate average pairwise evalues for them.
> I used hash references since these can be bit fast.
>
>
> I tested the following script on a machine with 2G of ram and on
> another with 94G of ram. It ran out of memory in both cases.
>
> Is it the case that it is not possible to hold this much data into
> a hash?
>
> Any suggestions to resolve the problem?
>
>
> Many Thanks,
>
> Intikhab Alam,
>
>
>
> I use the script as follows:
>
> To Fill the hash:
>
> while(<FILE>){
> my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst,
> $send, $evalue, $bit)= split /\s+|\t+/, $_;
> $query_results->{$q}{$h}{evalue}=$evalue;
> }
> close FILE;
>
> To get average pairwise evalues:
>
> my ($c, $hits)=(0, 0);
> open F, ">pairwiseValues";
> foreach my $query (%{$query_results}){
> if (exists ($query_results->{$query})){
> $c++;
> print "Query:$c:$query\n";
> foreach my $hit (%{$query_results->{$query}}){
> if (exists ($query_results->{$query}{$hit}{evalue})){
>
> my $k=$query_results->{$query}{$hit}{evalue};
> if ((exists ($query_results->{$query}{$hit}
> {evalue}) && ($query_results->{$hit}{$query}{evalue}))){
> my $i= $query_results->{$query}{$hit}{evalue};
> my $j= $query_results->{$hit}{$query}{evalue};
>
> $k=(($i+$j)/2);
>
> $query_results->{$query}{$hit}{evalue}=$k;
> $query_results->{$hit}{$query}{evalue}=$k;
> }
> $hits++;
> print F "$query\t$hit\t$k\n";
> }
> }
> }
> }
> close F;
> print "$hits hits read for $c queries \n";
>
> ----------------------------------------------
>
>
> _______________________________________________
> Bioclusters maillist - Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>
More information about the Bioclusters
mailing list