To avoid mapping everything in memory try using DB_File to tie the hash to a file - you still use the hash as normal but all the fetch/ store calls are done through BerkeleyDB (or equivalent) on the flatfile. But you'll have to adjust your strategy of Hashes-of-Hashes since you can only do key-value storage and not HoH. When I do this sort of thing I use a B-TREE with DB_File so that each key can have multiple values -- it requires a loop on the available values for a key but this is fairly small number for each key since the matrix is generally sparse (most proteins do not have more than a hundred hits depending on your sequences). Other options include storing the data in a database and changing your strategy to use DBI fetches. -jason -- Jason Stajich Duke University http://www.duke.edu/~jes12/ On Jun 1, 2005, at 8:25 AM, Intikhab Alam wrote: > Hi, > > I performed All-vs-All blast for 214777 proteins and wanted to > cluster these proteins. For clustering, I read that its better > if we use average pairwise evalues for bi-directional hits. > > I wrote a perl script to read blast results, to store hit evalues > for each query in a hash reference, to find bi-directionabl blast > hits and calculate average pairwise evalues for them. > I used hash references since these can be bit fast. > > > I tested the following script on a machine with 2G of ram and on > another with 94G of ram. It ran out of memory in both cases. > > Is it the case that it is not possible to hold this much data into > a hash? > > Any suggestions to resolve the problem? > > > Many Thanks, > > Intikhab Alam, > > > > I use the script as follows: > > To Fill the hash: > > while(<FILE>){ > my ($q, $h, $iden, $alilen, $mism, $gapo, $qst, $qend, $sst, > $send, $evalue, $bit)= split /\s+|\t+/, $_; > $query_results->{$q}{$h}{evalue}=$evalue; > } > close FILE; > > To get average pairwise evalues: > > my ($c, $hits)=(0, 0); > open F, ">pairwiseValues"; > foreach my $query (%{$query_results}){ > if (exists ($query_results->{$query})){ > $c++; > print "Query:$c:$query\n"; > foreach my $hit (%{$query_results->{$query}}){ > if (exists ($query_results->{$query}{$hit}{evalue})){ > > my $k=$query_results->{$query}{$hit}{evalue}; > if ((exists ($query_results->{$query}{$hit} > {evalue}) && ($query_results->{$hit}{$query}{evalue}))){ > my $i= $query_results->{$query}{$hit}{evalue}; > my $j= $query_results->{$hit}{$query}{evalue}; > > $k=(($i+$j)/2); > > $query_results->{$query}{$hit}{evalue}=$k; > $query_results->{$hit}{$query}{evalue}=$k; > } > $hits++; > print F "$query\t$hit\t$k\n"; > } > } > } > } > close F; > print "$hits hits read for $c queries \n"; > > ---------------------------------------------- > > > _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters >