Sorry for the fig... Here is empirical evidence for the non N**N nature of the algorithm... (I am board) X = number of HSPs Y = number of inner loop itterations. As you can see it is not nearly N**N, and is bounded by NlogN except for small X (X<50). The outer loop is visited once for each different cluster of HSPs (C), so for small X that is likely to be a bigger fraction of the total, hence more outer loop visits per inner. You get a near perfect fit with f(x) = 2x so I guess I can call the complexity N, if it wasn't for all those pesky splices... There are 2*2N+C splices, and each splice is O(N)... (again shrinking list size = faster to splice, but not sure exactly) Scaleable? Sob.... On 27 Aug 2003, Joseph Landman wrote: > On Wed, 2003-08-27 at 17:30, Dan Bolser wrote: > > Joseph Landman said: > > > Hi Dan: > > > > > > I am assuming that your arrays contain not HSP's, but some sort of > > > object representing the HSP ala BioPerl. Is this correct? > > > > > > > Nope, just a simple hash, key => value, > > I.E. > > > > { > > hsp_hit-from => 5, > > hsp_hit-to => 10, > > score => 20, > > } > > > > Ok. > > Comments interspersed: > > > > >> __SKIP__ > > >> > > >> preamble > > >> > > >> @hsps = array of HSP hashes, for a particular protein > > >> each HSP can be from several SCOP sequences. > > >> > > >> __RESUME__ > > >> > > >> my @result; # Final HSP's > > >> > > >> TOP:while (@hsps){ # NB: Ordered by Hsp_query-from > > >> # (for optimzation). > > >> > > >> my $p = 0; # Current HSP pointer. > > >> > > >> MID:for (my $j=$p+1; $j<@hsps; $j++){ # Overlap slider. > > These two loops give you something comparable to an O(N^2) algorithm. > That is, I can rewrite the while and the for as > > foreach my $hsp (@hsps) # Outer O(N): 1 iteration per entry > foreach my $index (0 .. $#hsps) # Inner O(N): 1 iteration per > entry > > > > > > >> > > >> # Family overlap only! > > >> > > >> next MID if > > >> $hsps[$p]->{SCCS} != $hsps[$j]->{SCCS}; > > Short circuits are good. > > > >> > > >> # Optimization. > > >> > > >> if ( $THRESH > > > >> $hsps[$p]->{P_END} - $hsps[$j]->{P_START} ){ > > >> > > >> shift @hsps; > > >> next TOP; > > >> } > > A shift or a similar operation (splice, push, etc) will likely trigger > the garbage collector in Perl, as well as the memory allocator > > > >> > > >> # Pick best of pair (removing the other from the list). > > >> > > >> if ( $hsps[$p]->{E_VALUE} > $hsps[$j]->{E_VALUE} ){ > > >> splice (@hsps, $p, 1); > > >> $j--; > > >> $p = $j; > > >> } > > >> else { > > >> splice (@hsps, $j, 1); > > >> $j--; > > >> } > > >> } > > >> push @result, splice(@hsps, $p, 1); > > >> } > > >> print "OK\n\n"; > > I might suggest that instead of moving data around (splice, push, shift) > and causing allocation/GC events, that you might maintain several lists, > indicating which class the HSP's belong to. > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: itter_v_hsp_number3.gif Type: image/gif Size: 6629 bytes Desc: Url : http://bioinformatics.org/pipermail/biodevelopers/attachments/20030828/1d46b995/itter_v_hsp_number3.gif