[Biodevelopers] Re: splice advice...

Thu Aug 28 08:54:56 EDT 2003

On Thu, 2003-08-28 at 04:07, Dan Bolser wrote:

> Yup, but for each inner N, N gets smaller by one. is this still O(N^2) ?

Unfortunately, yes.  I think you have an extra factor of N distributed. 
The expression after factoring ...

N * ( 1 + (1-1/N) + (1-2/N) + ... + 0) 
  -->                                     <--
  N iterations

thus you can write this as approximately

N * ( N - sum of terms that look like m/N for m=1..N)

As N gets large, the sum gets small so to leading order, you have

O(N*N) or O(N^2).

[...]

> Hmmm, I am not sure how this relates to the above now I think about
> it.

Short circuits can improve the constant out front of the algorithms
order.  They can improve performance by avoiding unnecessary
computation.  You simply need the short circuit comparison time to be
short compared to the calculation it is wrapping.  The calculation needs
to be expensive in comparison.  Sometimes, you can precompute the short
circuit values, and then do a lookup in a table if you are computing and
recomputing them.  This is important if you are doing multiple
dereferences over and over.  These can be expensive in perl (a list of
hashes could be costly to dereference, though a list of lists could be
faster).

> 
> > 
> > > >>
> > > >>       # Optimization.
> > > >>
> > > >>       if ( $THRESH >
> > > >>              $hsps[$p]->{P_END} - $hsps[$j]->{P_START} ){
> > > >>
> > > >>         shift @hsps;
> > > >>         next TOP;
> > > >>       }
> > 
> > A shift or a similar operation (splice, push, etc) will likely trigger
> > the garbage collector in Perl, as well as the memory allocator
> 
> I see
> 
> 
> > I might suggest that instead of moving data around (splice, push, shift)
> > and causing allocation/GC events, that you might maintain several lists,
> > indicating which class the HSP's belong to.
> 
> Using a hash?

You could, though you would need to carefully think about whether or not
the hash deref (accessing the variable value) is more expensive than a
list index.

> I still think that I will have to loop through more than once...

You will probably have to.

> Maby if I just sort by Hsp_query-from and Hsp_e-value (using a sort vector
> of the form
> 
> @sort_vector_e_value 
>   = sort{ $hsps[$a]->{E_VALUE} <=> $hsps[$b]->{E_VALUE} } (0..$#hsps);

If you can do this, and localize it somehow, you might be able to (on
multiple CPUS) start from two different ends of the data structure in
different threads.

> I think sort is O(N*log(N)), so two sorts is still NlogN, then maby I
> can do the whole thing in one pass...
> 
> Thanks very much for your help,

No problem.  Have you used Devel::DProf by the way? There is nothing
like a good profile to identify code hotspots.  I usually recommend
these before working on improvements...

> Cheers,
> Dan.
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615