On Thu, 2003-08-28 at 04:07, Dan Bolser wrote: > Yup, but for each inner N, N gets smaller by one. is this still O(N^2) ? Unfortunately, yes. I think you have an extra factor of N distributed. The expression after factoring ... N * ( 1 + (1-1/N) + (1-2/N) + ... + 0) --> <-- N iterations thus you can write this as approximately N * ( N - sum of terms that look like m/N for m=1..N) As N gets large, the sum gets small so to leading order, you have O(N*N) or O(N^2). [...] > Hmmm, I am not sure how this relates to the above now I think about > it. Short circuits can improve the constant out front of the algorithms order. They can improve performance by avoiding unnecessary computation. You simply need the short circuit comparison time to be short compared to the calculation it is wrapping. The calculation needs to be expensive in comparison. Sometimes, you can precompute the short circuit values, and then do a lookup in a table if you are computing and recomputing them. This is important if you are doing multiple dereferences over and over. These can be expensive in perl (a list of hashes could be costly to dereference, though a list of lists could be faster). > > > > > > >> > > > >> # Optimization. > > > >> > > > >> if ( $THRESH > > > > >> $hsps[$p]->{P_END} - $hsps[$j]->{P_START} ){ > > > >> > > > >> shift @hsps; > > > >> next TOP; > > > >> } > > > > A shift or a similar operation (splice, push, etc) will likely trigger > > the garbage collector in Perl, as well as the memory allocator > > I see > > > > I might suggest that instead of moving data around (splice, push, shift) > > and causing allocation/GC events, that you might maintain several lists, > > indicating which class the HSP's belong to. > > Using a hash? You could, though you would need to carefully think about whether or not the hash deref (accessing the variable value) is more expensive than a list index. > I still think that I will have to loop through more than once... You will probably have to. > Maby if I just sort by Hsp_query-from and Hsp_e-value (using a sort vector > of the form > > @sort_vector_e_value > = sort{ $hsps[$a]->{E_VALUE} <=> $hsps[$b]->{E_VALUE} } (0..$#hsps); If you can do this, and localize it somehow, you might be able to (on multiple CPUS) start from two different ends of the data structure in different threads. > I think sort is O(N*log(N)), so two sorts is still NlogN, then maby I > can do the whole thing in one pass... > > Thanks very much for your help, No problem. Have you used Devel::DProf by the way? There is nothing like a good profile to identify code hotspots. I usually recommend these before working on improvements... > Cheers, > Dan. -- Joseph Landman, Ph.D Scalable Informatics LLC email: landman at scalableinformatics.com web: http://scalableinformatics.com phone: +1 734 612 4615