[Biodevelopers] Re: splice and clear...

Thu Aug 28 17:47:04 EDT 2003

Joseph Landman said:
> On Thu, 2003-08-28 at 14:00, Dan Bolser wrote:
>> Sorry for the fig...
>>
>> Here is empirical evidence for the non N**N
>> nature of the algorithm... (I am board)
>
> Not N**N (e.g. N^N or N to the Nth power), but N*N (or N^2, N squared).

sorry, my mistake

>> X = number of HSPs
>> Y = number of inner loop itterations.
>
> So the inner loop goes c*N (c = some constant, empirically determined to be 2).
> The outer loop goes N times.
>
> This is still O(N^2) (Order N squared), if I understand your
> plot/analysis correctly.

I think you make one mistake in interpretation...

>
>> As you can see it is not nearly N**N, and is
>> bounded by NlogN except for small X (X<50).
>
> I don't see a plot of
>
> 	y = constant_1 * x log x + constant_2
>
> for "optimal" values of the constants.  All I see in the plot is
>
> 	. test o (red dot)
> 	. x*2 (green line)
> 	. x    (blue line)

Sorry, I can send along the plot I had, but I thought the 2N green line
was better (I can't work out how to shift the legend in gnuplot,
so it looked a bit mesy with all three lines).

> As Y is the number of iterations around the inner loop (outer loop we agree will
> have N iterations), it appears that you have found evidence to support that the
> inner loop iterations scale as N*2 (2N if you
> prefer).

:) it is better.

>>
>> The outer loop is visited once for each different
>> cluster of HSPs (C), so for small X that is likely
>> to be a bigger fraction of the total, hence more outer
>> loop visits per inner.
>>
>> You get a near perfect fit with f(x) = 2x
>
> Well, I haven't see the fitting analysis, but using the old approximate eyeball
> technique, I agree that the inner loop as you have described it, appears to be
> O(N).  Makes perfect sense.

Hmmm, above you said..

> This is still O(N^2) (Order N squared)

The N**2 line should out strip anything shown, but maby it explains
the N<50 upper trend which is quite pronouned.

>> so I guess I can call the complexity N, if it wasn't for
>> all those pesky splices...
>
> Uh... not really.  You still have that outer loop, and that outer loop is O(N).
> Unless you can make the inner loop O(log N) which would be recrafting the
> algorithm significantly, you are stuck with an outer loop over all HSPs, and an
> inner loop over some fraction of the same set of HSPs.

No, I think this is where you make the mistake, the outer loop loops over
the same array we are shortening in the inner loop, so it starts out being
N, but when we return to the outer loop after the inner loop it is shortened
by the number of times we traverse the inner loop, meaning it only gets
called C number of times, not N.

> This reminds me of factorized matrix codes.  Similar concept, each additional row
> processed reduces the elements to process by one.
> Doesn't change the O(N^2) nature, just the constant out front of it.
>
> Example:
>
> Iteration 1:
>
> 	# N=5
> 	# define M=N
> 	@hsps = qw( 1 2 3 4 5 );
>
> inner loop over 1..4  (order N)
> find one that is greater than $THRESH, shift it off the front, outer loop next
> iteration.
>
> Iteration 2:
> 	# N is still 5,

N is now 4 in my algorithm.

> So you have this inner loop which has this pesky O(N) thing happening.  You are
> working your way down the O(N) outside loop.  It is still O(N*N) or order N
> squared or O(N^2).  For those of us old Fortran types, it is O(N**2).

I have bad habits from excel ;)

>> There are 2*2N+C splices, and each splice is O(N)...
>
> Yup.  Thats my point.
>
>   Loop over splice (O(N))
>    calculate the splice (O(N))

Hmmm, I need to look back to see how the splice affects
things again...

>> Scaleable?
>>
>> Sob....
>
> Unfortunately not, though there are tricks you can play if you can decompose your
> list of HSP's, and do some boundary bits.  Make M
> different HSP lists, each 1/Mth of the total list.  Assign a processor to do these
> M lists.  You will still have to deal with boundaries, and you will have to
> communicate and fuse your results back, but you could (almost) turn an O(N*N)
> problem into an O(N*N/M) problem.  I am glossing over an amazing number of details
> here, but I am proposing that if this is really the expensive portion of your
> computing, you might want to invest the time into parallelizing it.

Sounds interesting, and I am lucky enough to have access to multi cpu
hardware, I just cant help thinking some clever pointers are all I need.

Thanks very much for all you time and help,
sorry about those fantom repeat posts - was it just me?

Thanks again,
Dan.

>> > > >> __SKIP__
>> > > >>
>> > > >> preamble
>> > > >>
>> > > >> @hsps = array of HSP hashes, for a particular protein
>> > > >> each HSP can be from several SCOP sequences.
>> > > >>
>> > > >> __RESUME__
>> > > >>
>> > > >>   my @result;						# Final HSP's
>> > > >>
>> > > >>   TOP:while (@hsps){			  # NB: Ordered by Hsp_query-from
>> > > >>                                           # (for optimzation).
>> > > >>
>> > > >>     my $p = 0;	                          # Current HSP pointer.
>> > > >>
>> > > >>     MID:for (my $j=$p+1; $j<@hsps; $j++){ # Overlap slider.
>> >
>> > These two loops give you something comparable to an O(N^2) algorithm.  That
>> is, I can rewrite the while and the for as
>> >
>> > 	foreach my $hsp (@hsps) # Outer O(N): 1 iteration per entry
>> >   	  foreach my $index (0 .. $#hsps) # Inner O(N): 1 iteration  per
>> > entry
>> >
>> >
>> >
>> >
>> > > >>
>> > > >>      # Family overlap only!
>> > > >>
>> > > >>       next MID if
>> > > >>         $hsps[$p]->{SCCS} != $hsps[$j]->{SCCS};
>> >
>> > Short circuits are good.
>> >
>> > > >>
>> > > >>       # Optimization.
>> > > >>
>> > > >>       if ( $THRESH >
>> > > >>              $hsps[$p]->{P_END} - $hsps[$j]->{P_START} ){
>> > > >>
>> > > >>         shift @hsps;
>> > > >>         next TOP;
>> > > >>       }
>> >
>> > A shift or a similar operation (splice, push, etc) will likely trigger the
>> garbage collector in Perl, as well as the memory allocator
>> >
>> > > >>
>> > > >>       # Pick best of pair (removing the other from the list).
>> > > >>
>> > > >>       if ( $hsps[$p]->{E_VALUE} > $hsps[$j]->{E_VALUE} ){
>> > > >>         splice (@hsps, $p, 1);
>> > > >>         $j--;
>> > > >>         $p = $j;
>> > > >>       }
>> > > >>       else {
>> > > >>         splice (@hsps, $j, 1);
>> > > >>         $j--;
>> > > >>       }
>> > > >>     }
>> > > >>     push @result, splice(@hsps, $p, 1);
>> > > >>   }
>> > > >>   print "OK\n\n";
>> >
>> > I might suggest that instead of moving data around (splice, push, shift) and
>> causing allocation/GC events, that you might maintain several lists,
>> indicating which class the HSP's belong to.
>> >
>> >
>> >
> --
> Joseph Landman, Ph.D
> Scalable Informatics LLC
> email: landman at scalableinformatics.com
>   web: http://scalableinformatics.com
> phone: +1 734 612 4615