[Biodevelopers] Re: splice and clear...

Thu Aug 28 14:34:11 EDT 2003

On Thu, 2003-08-28 at 14:00, Dan Bolser wrote:
> Sorry for the fig...
> 
> Here is empirical evidence for the non N**N
> nature of the algorithm... (I am board)

Not N**N (e.g. N^N or N to the Nth power), but N*N (or N^2, N squared).

> X = number of HSPs
> Y = number of inner loop itterations.

So the inner loop goes c*N (c = some constant, empirically determined to
be 2).  The outer loop goes N times.

This is still O(N^2) (Order N squared), if I understand your
plot/analysis correctly.

> As you can see it is not nearly N**N, and is
> bounded by NlogN except for small X (X<50).

I don't see a plot of 

	y = constant_1 * x log x + constant_2

for "optimal" values of the constants.  All I see in the plot is 

	. test o (red dot)
	. x*2 (green line)
	. x    (blue line)

As Y is the number of iterations around the inner loop (outer loop we
agree will have N iterations), it appears that you have found evidence
to support that the inner loop iterations scale as N*2 (2N if you
prefer).

> 
> The outer loop is visited once for each different
> cluster of HSPs (C), so for small X that is likely
> to be a bigger fraction of the total, hence more outer
> loop visits per inner. 
> 
> You get a near perfect fit with f(x) = 2x 

Well, I haven't see the fitting analysis, but using the old approximate
eyeball technique, I agree that the inner loop as you have described it,
appears to be O(N).  Makes perfect sense.

> so I guess I can call the complexity N, if it wasn't for
> all those pesky splices...

Uh... not really.  You still have that outer loop, and that outer loop
is O(N).  Unless you can make the inner loop O(log N) which would be
recrafting the algorithm significantly, you are stuck with an outer loop
over all HSPs, and an inner loop over some fraction of the same set of
HSPs.

This reminds me of factorized matrix codes.  Similar concept, each
additional row processed reduces the elements to process by one. 
Doesn't change the O(N^2) nature, just the constant out front of it. 

Example:  

Iteration 1:

	# N=5
	# define M=N
	@hsps = qw( 1 2 3 4 5 );

inner loop over 1..4  (order N)
find one that is greater than $THRESH, shift it off the front, outer
loop next iteration. 

Iteration 2:
	# N is still 5,
	# let M be N-1
	@hsps = qw( 2 3 4 5 );	

inner loop over 2..4 (order N-1, but for large enough N, you drop the 1)

inner loop over 2..4  (order N-1, see rest of comment above)
find one that is greater than $THRESH, shift it off the front, outer
loop next iteration. 

Iteration 3:
	# N is still 5,
	# let M be N-2
	@hsps = qw( 3 4 5 );	

inner loop over 3..4 (order N-2, but for large enough N, you drop the 1)

inner loop over 3..4 (order N-2, see rest of comment above)
find one that is greater than $THRESH, shift it off the front, outer
loop next iteration. 

...

Iteration 5:
	# N is still 5,
	# let M be N-4
	@hsps = qw( 5 );

No inner loop.

So you have this inner loop which has this pesky O(N) thing happening. 
You are working your way down the O(N) outside loop.  It is still O(N*N)
or order N squared or O(N^2).  For those of us old Fortran types, it is
O(N**2).

> There are 2*2N+C splices, and each splice is O(N)...

Yup.  Thats my point.  

  Loop over splice (O(N))
   calculate the splice (O(N))

Algorithm is O(N**2).

> (again shrinking list size = faster to splice, but not
> sure exactly)
> 
> Scaleable?
> 
> Sob....

Unfortunately not, though there are tricks you can play if you can
decompose your list of HSP's, and do some boundary bits.  Make M
different HSP lists, each 1/Mth of the total list.  Assign a processor
to do these M lists.  You will still have to deal with boundaries, and
you will have to communicate and fuse your results back, but you could
(almost) turn an O(N*N) problem into an O(N*N/M) problem.  I am glossing
over an amazing number of details here, but I am proposing that if this
is really the expensive portion of your computing, you might want to
invest the time into parallelizing it.

> 
> 
> 
> On 27 Aug 2003, Joseph Landman wrote:
> 
> > On Wed, 2003-08-27 at 17:30, Dan Bolser wrote:
> > > Joseph Landman said:
> > > > Hi Dan:
> > > >
> > > >   I am assuming that your arrays contain not HSP's, but some sort of
> > > > object representing the HSP ala BioPerl.  Is this correct?
> > > >
> > > 
> > > Nope, just a simple hash, key => value,
> > > I.E.
> > > 
> > > {
> > >   hsp_hit-from => 5,
> > >   hsp_hit-to   => 10,
> > >   score        => 20,
> > > }
> > > 
> > 
> > Ok.
> > 
> > Comments interspersed:
> > 
> > 
> > > >> __SKIP__
> > > >>
> > > >> preamble
> > > >>
> > > >> @hsps = array of HSP hashes, for a particular protein
> > > >> each HSP can be from several SCOP sequences.
> > > >>
> > > >> __RESUME__
> > > >>
> > > >>   my @result;						# Final HSP's
> > > >>
> > > >>   TOP:while (@hsps){			  # NB: Ordered by Hsp_query-from
> > > >>                                           # (for optimzation).
> > > >>
> > > >>     my $p = 0;	                          # Current HSP pointer.
> > > >>
> > > >>     MID:for (my $j=$p+1; $j<@hsps; $j++){ # Overlap slider.
> > 
> > These two loops give you something comparable to an O(N^2) algorithm. 
> > That is, I can rewrite the while and the for as
> > 
> > 	foreach my $hsp (@hsps) # Outer O(N): 1 iteration per entry
> >   	  foreach my $index (0 .. $#hsps) # Inner O(N): 1 iteration  per
> > entry
> > 
> > 
> > 
> > 
> > > >>
> > > >>      # Family overlap only!
> > > >>
> > > >>       next MID if
> > > >>         $hsps[$p]->{SCCS} != $hsps[$j]->{SCCS};
> > 
> > Short circuits are good.
> > 
> > > >>
> > > >>       # Optimization.
> > > >>
> > > >>       if ( $THRESH >
> > > >>              $hsps[$p]->{P_END} - $hsps[$j]->{P_START} ){
> > > >>
> > > >>         shift @hsps;
> > > >>         next TOP;
> > > >>       }
> > 
> > A shift or a similar operation (splice, push, etc) will likely trigger
> > the garbage collector in Perl, as well as the memory allocator
> > 
> > > >>
> > > >>       # Pick best of pair (removing the other from the list).
> > > >>
> > > >>       if ( $hsps[$p]->{E_VALUE} > $hsps[$j]->{E_VALUE} ){
> > > >>         splice (@hsps, $p, 1);
> > > >>         $j--;
> > > >>         $p = $j;
> > > >>       }
> > > >>       else {
> > > >>         splice (@hsps, $j, 1);
> > > >>         $j--;
> > > >>       }
> > > >>     }
> > > >>     push @result, splice(@hsps, $p, 1);
> > > >>   }
> > > >>   print "OK\n\n";
> > 
> > I might suggest that instead of moving data around (splice, push, shift)
> > and causing allocation/GC events, that you might maintain several lists,
> > indicating which class the HSP's belong to.
> > 
> > 
> > 
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615