On Thu, 2003-08-28 at 14:00, Dan Bolser wrote: > Sorry for the fig... > > Here is empirical evidence for the non N**N > nature of the algorithm... (I am board) Not N**N (e.g. N^N or N to the Nth power), but N*N (or N^2, N squared). > X = number of HSPs > Y = number of inner loop itterations. So the inner loop goes c*N (c = some constant, empirically determined to be 2). The outer loop goes N times. This is still O(N^2) (Order N squared), if I understand your plot/analysis correctly. > As you can see it is not nearly N**N, and is > bounded by NlogN except for small X (X<50). I don't see a plot of y = constant_1 * x log x + constant_2 for "optimal" values of the constants. All I see in the plot is . test o (red dot) . x*2 (green line) . x (blue line) As Y is the number of iterations around the inner loop (outer loop we agree will have N iterations), it appears that you have found evidence to support that the inner loop iterations scale as N*2 (2N if you prefer). > > The outer loop is visited once for each different > cluster of HSPs (C), so for small X that is likely > to be a bigger fraction of the total, hence more outer > loop visits per inner. > > You get a near perfect fit with f(x) = 2x Well, I haven't see the fitting analysis, but using the old approximate eyeball technique, I agree that the inner loop as you have described it, appears to be O(N). Makes perfect sense. > so I guess I can call the complexity N, if it wasn't for > all those pesky splices... Uh... not really. You still have that outer loop, and that outer loop is O(N). Unless you can make the inner loop O(log N) which would be recrafting the algorithm significantly, you are stuck with an outer loop over all HSPs, and an inner loop over some fraction of the same set of HSPs. This reminds me of factorized matrix codes. Similar concept, each additional row processed reduces the elements to process by one. Doesn't change the O(N^2) nature, just the constant out front of it. Example: Iteration 1: # N=5 # define M=N @hsps = qw( 1 2 3 4 5 ); inner loop over 1..4 (order N) find one that is greater than $THRESH, shift it off the front, outer loop next iteration. Iteration 2: # N is still 5, # let M be N-1 @hsps = qw( 2 3 4 5 ); inner loop over 2..4 (order N-1, but for large enough N, you drop the 1) inner loop over 2..4 (order N-1, see rest of comment above) find one that is greater than $THRESH, shift it off the front, outer loop next iteration. Iteration 3: # N is still 5, # let M be N-2 @hsps = qw( 3 4 5 ); inner loop over 3..4 (order N-2, but for large enough N, you drop the 1) inner loop over 3..4 (order N-2, see rest of comment above) find one that is greater than $THRESH, shift it off the front, outer loop next iteration. ... Iteration 5: # N is still 5, # let M be N-4 @hsps = qw( 5 ); No inner loop. So you have this inner loop which has this pesky O(N) thing happening. You are working your way down the O(N) outside loop. It is still O(N*N) or order N squared or O(N^2). For those of us old Fortran types, it is O(N**2). > There are 2*2N+C splices, and each splice is O(N)... Yup. Thats my point. Loop over splice (O(N)) calculate the splice (O(N)) Algorithm is O(N**2). > (again shrinking list size = faster to splice, but not > sure exactly) > > Scaleable? > > Sob.... Unfortunately not, though there are tricks you can play if you can decompose your list of HSP's, and do some boundary bits. Make M different HSP lists, each 1/Mth of the total list. Assign a processor to do these M lists. You will still have to deal with boundaries, and you will have to communicate and fuse your results back, but you could (almost) turn an O(N*N) problem into an O(N*N/M) problem. I am glossing over an amazing number of details here, but I am proposing that if this is really the expensive portion of your computing, you might want to invest the time into parallelizing it. > > > > On 27 Aug 2003, Joseph Landman wrote: > > > On Wed, 2003-08-27 at 17:30, Dan Bolser wrote: > > > Joseph Landman said: > > > > Hi Dan: > > > > > > > > I am assuming that your arrays contain not HSP's, but some sort of > > > > object representing the HSP ala BioPerl. Is this correct? > > > > > > > > > > Nope, just a simple hash, key => value, > > > I.E. > > > > > > { > > > hsp_hit-from => 5, > > > hsp_hit-to => 10, > > > score => 20, > > > } > > > > > > > Ok. > > > > Comments interspersed: > > > > > > > >> __SKIP__ > > > >> > > > >> preamble > > > >> > > > >> @hsps = array of HSP hashes, for a particular protein > > > >> each HSP can be from several SCOP sequences. > > > >> > > > >> __RESUME__ > > > >> > > > >> my @result; # Final HSP's > > > >> > > > >> TOP:while (@hsps){ # NB: Ordered by Hsp_query-from > > > >> # (for optimzation). > > > >> > > > >> my $p = 0; # Current HSP pointer. > > > >> > > > >> MID:for (my $j=$p+1; $j<@hsps; $j++){ # Overlap slider. > > > > These two loops give you something comparable to an O(N^2) algorithm. > > That is, I can rewrite the while and the for as > > > > foreach my $hsp (@hsps) # Outer O(N): 1 iteration per entry > > foreach my $index (0 .. $#hsps) # Inner O(N): 1 iteration per > > entry > > > > > > > > > > > >> > > > >> # Family overlap only! > > > >> > > > >> next MID if > > > >> $hsps[$p]->{SCCS} != $hsps[$j]->{SCCS}; > > > > Short circuits are good. > > > > > >> > > > >> # Optimization. > > > >> > > > >> if ( $THRESH > > > > >> $hsps[$p]->{P_END} - $hsps[$j]->{P_START} ){ > > > >> > > > >> shift @hsps; > > > >> next TOP; > > > >> } > > > > A shift or a similar operation (splice, push, etc) will likely trigger > > the garbage collector in Perl, as well as the memory allocator > > > > > >> > > > >> # Pick best of pair (removing the other from the list). > > > >> > > > >> if ( $hsps[$p]->{E_VALUE} > $hsps[$j]->{E_VALUE} ){ > > > >> splice (@hsps, $p, 1); > > > >> $j--; > > > >> $p = $j; > > > >> } > > > >> else { > > > >> splice (@hsps, $j, 1); > > > >> $j--; > > > >> } > > > >> } > > > >> push @result, splice(@hsps, $p, 1); > > > >> } > > > >> print "OK\n\n"; > > > > I might suggest that instead of moving data around (splice, push, shift) > > and causing allocation/GC events, that you might maintain several lists, > > indicating which class the HSP's belong to. > > > > > > -- Joseph Landman, Ph.D Scalable Informatics LLC email: landman at scalableinformatics.com web: http://scalableinformatics.com phone: +1 734 612 4615