[Biodevelopers] Re: splice advice...

Joseph Landman landman at scalableinformatics.com
Wed Aug 27 18:53:36 EDT 2003


On Wed, 2003-08-27 at 17:30, Dan Bolser wrote:
> Joseph Landman said:
> > Hi Dan:
> >
> >   I am assuming that your arrays contain not HSP's, but some sort of
> > object representing the HSP ala BioPerl.  Is this correct?
> >
> 
> Nope, just a simple hash, key => value,
> I.E.
> 
> {
>   hsp_hit-from => 5,
>   hsp_hit-to   => 10,
>   score        => 20,
> }
> 

Ok.

Comments interspersed:


> >> __SKIP__
> >>
> >> preamble
> >>
> >> @hsps = array of HSP hashes, for a particular protein
> >> each HSP can be from several SCOP sequences.
> >>
> >> __RESUME__
> >>
> >>   my @result;						# Final HSP's
> >>
> >>   TOP:while (@hsps){			  # NB: Ordered by Hsp_query-from
> >>                                           # (for optimzation).
> >>
> >>     my $p = 0;	                          # Current HSP pointer.
> >>
> >>     MID:for (my $j=$p+1; $j<@hsps; $j++){ # Overlap slider.

These two loops give you something comparable to an O(N^2) algorithm. 
That is, I can rewrite the while and the for as

	foreach my $hsp (@hsps) # Outer O(N): 1 iteration per entry
  	  foreach my $index (0 .. $#hsps) # Inner O(N): 1 iteration  per
entry




> >>
> >>      # Family overlap only!
> >>
> >>       next MID if
> >>         $hsps[$p]->{SCCS} != $hsps[$j]->{SCCS};

Short circuits are good.

> >>
> >>       # Optimization.
> >>
> >>       if ( $THRESH >
> >>              $hsps[$p]->{P_END} - $hsps[$j]->{P_START} ){
> >>
> >>         shift @hsps;
> >>         next TOP;
> >>       }

A shift or a similar operation (splice, push, etc) will likely trigger
the garbage collector in Perl, as well as the memory allocator

> >>
> >>       # Pick best of pair (removing the other from the list).
> >>
> >>       if ( $hsps[$p]->{E_VALUE} > $hsps[$j]->{E_VALUE} ){
> >>         splice (@hsps, $p, 1);
> >>         $j--;
> >>         $p = $j;
> >>       }
> >>       else {
> >>         splice (@hsps, $j, 1);
> >>         $j--;
> >>       }
> >>     }
> >>     push @result, splice(@hsps, $p, 1);
> >>   }
> >>   print "OK\n\n";

I might suggest that instead of moving data around (splice, push, shift)
and causing allocation/GC events, that you might maintain several lists,
indicating which class the HSP's belong to.


-- 
Joseph Landman, Ph.D
Scalable Informatics LLC
email: landman at scalableinformatics.com
  web: http://scalableinformatics.com
phone: +1 734 612 4615





More information about the Biodevelopers mailing list