[Biodevelopers] splice and clear...

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Thu Aug 28 14:00:18 EDT 2003


Sorry for the fig...

Here is empirical evidence for the non N**N
nature of the algorithm... (I am board)

X = number of HSPs
Y = number of inner loop itterations.

As you can see it is not nearly N**N, and is
bounded by NlogN except for small X (X<50).

The outer loop is visited once for each different
cluster of HSPs (C), so for small X that is likely
to be a bigger fraction of the total, hence more outer
loop visits per inner. 

You get a near perfect fit with f(x) = 2x 

so I guess I can call the complexity N, if it wasn't for
all those pesky splices...

There are 2*2N+C splices, and each splice is O(N)...

(again shrinking list size = faster to splice, but not
sure exactly)

Scaleable?

Sob....



On 27 Aug 2003, Joseph Landman wrote:

> On Wed, 2003-08-27 at 17:30, Dan Bolser wrote:
> > Joseph Landman said:
> > > Hi Dan:
> > >
> > >   I am assuming that your arrays contain not HSP's, but some sort of
> > > object representing the HSP ala BioPerl.  Is this correct?
> > >
> > 
> > Nope, just a simple hash, key => value,
> > I.E.
> > 
> > {
> >   hsp_hit-from => 5,
> >   hsp_hit-to   => 10,
> >   score        => 20,
> > }
> > 
> 
> Ok.
> 
> Comments interspersed:
> 
> 
> > >> __SKIP__
> > >>
> > >> preamble
> > >>
> > >> @hsps = array of HSP hashes, for a particular protein
> > >> each HSP can be from several SCOP sequences.
> > >>
> > >> __RESUME__
> > >>
> > >>   my @result;						# Final HSP's
> > >>
> > >>   TOP:while (@hsps){			  # NB: Ordered by Hsp_query-from
> > >>                                           # (for optimzation).
> > >>
> > >>     my $p = 0;	                          # Current HSP pointer.
> > >>
> > >>     MID:for (my $j=$p+1; $j<@hsps; $j++){ # Overlap slider.
> 
> These two loops give you something comparable to an O(N^2) algorithm. 
> That is, I can rewrite the while and the for as
> 
> 	foreach my $hsp (@hsps) # Outer O(N): 1 iteration per entry
>   	  foreach my $index (0 .. $#hsps) # Inner O(N): 1 iteration  per
> entry
> 
> 
> 
> 
> > >>
> > >>      # Family overlap only!
> > >>
> > >>       next MID if
> > >>         $hsps[$p]->{SCCS} != $hsps[$j]->{SCCS};
> 
> Short circuits are good.
> 
> > >>
> > >>       # Optimization.
> > >>
> > >>       if ( $THRESH >
> > >>              $hsps[$p]->{P_END} - $hsps[$j]->{P_START} ){
> > >>
> > >>         shift @hsps;
> > >>         next TOP;
> > >>       }
> 
> A shift or a similar operation (splice, push, etc) will likely trigger
> the garbage collector in Perl, as well as the memory allocator
> 
> > >>
> > >>       # Pick best of pair (removing the other from the list).
> > >>
> > >>       if ( $hsps[$p]->{E_VALUE} > $hsps[$j]->{E_VALUE} ){
> > >>         splice (@hsps, $p, 1);
> > >>         $j--;
> > >>         $p = $j;
> > >>       }
> > >>       else {
> > >>         splice (@hsps, $j, 1);
> > >>         $j--;
> > >>       }
> > >>     }
> > >>     push @result, splice(@hsps, $p, 1);
> > >>   }
> > >>   print "OK\n\n";
> 
> I might suggest that instead of moving data around (splice, push, shift)
> and causing allocation/GC events, that you might maintain several lists,
> indicating which class the HSP's belong to.
> 
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itter_v_hsp_number3.gif
Type: image/gif
Size: 6629 bytes
Desc: 
Url : http://bioinformatics.org/pipermail/biodevelopers/attachments/20030828/1d46b995/itter_v_hsp_number3.gif


More information about the Biodevelopers mailing list