[Biodevelopers] Re: splice advice...

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Thu Aug 28 04:07:19 EDT 2003


Cheers...

On 27 Aug 2003, Joseph Landman wrote:
 
> Comments interspersed:
> 
> 
> > >>
> > >>   my @result;						# Final HSP's
> > >>
> > >>   TOP:while (@hsps){			  # NB: Ordered by Hsp_query-from
> > >>                                           # (for optimzation).
> > >>
> > >>     my $p = 0;	                          # Current HSP pointer.
> > >>
> > >>     MID:for (my $j=$p+1; $j<@hsps; $j++){ # Overlap slider.
> 
> These two loops give you something comparable to an O(N^2) algorithm. 
> That is, I can rewrite the while and the for as

Yup, but for each inner N, N gets smaller by one. is this still O(N^2) ?

i.e.  

N*N + N*(N-1) + N*(N-2) + ... + N*(N-(N-2)) + N*(N-(N-1)) + 0;

> > >>
> > >>      # Family overlap only!
> > >>
> > >>       next MID if
> > >>         $hsps[$p]->{SCCS} != $hsps[$j]->{SCCS};
> 
> Short circuits are good.

Yup, but this one is fundamental for the way the algorithm works,
I.e. The outer loop is called once per unique SCOP family in the 
HSP's for this protein...

Hmmm, I am not sure how this relates to the above now I think about
it.

> 
> > >>
> > >>       # Optimization.
> > >>
> > >>       if ( $THRESH >
> > >>              $hsps[$p]->{P_END} - $hsps[$j]->{P_START} ){
> > >>
> > >>         shift @hsps;
> > >>         next TOP;
> > >>       }
> 
> A shift or a similar operation (splice, push, etc) will likely trigger
> the garbage collector in Perl, as well as the memory allocator

I see


> I might suggest that instead of moving data around (splice, push, shift)
> and causing allocation/GC events, that you might maintain several lists,
> indicating which class the HSP's belong to.

Using a hash?

I still think that I will have to loop through more than once...

Maby if I just sort by Hsp_query-from and Hsp_e-value (using a sort vector
of the form

@sort_vector_e_value 
  = sort{ $hsps[$a]->{E_VALUE} <=> $hsps[$b]->{E_VALUE} } (0..$#hsps);

I think sort is O(N*log(N)), so two sorts is still NlogN, then maby I
can do the whole thing in one pass...

Thanks very much for your help,
Cheers,
Dan.




More information about the Biodevelopers mailing list