[BiO BB] Need fair alignment tool comparison/ using DSCAM for tool testing

Tue Feb 19 13:46:48 EST 2008

On Tue, 19 Feb 2008, Mike Marchywka wrote:

...    
> Do you have actual timing tests for various complete tasks or is 17
> seconds about it?  

Aligning 172 sequences totaling 1.2MB with each other (average ~7000 bases
each, longest 22,830 bases):

real    3m17.458s
user    3m2.385s
sys     0m9.294s

MUMmer output generated 693 alignment files, out of the possible 29584
combinations of sequences, with alignment lengths ranging from about 90
bases to 22830 (the longest with itself).

The process used the 1.2MB multi-sequence file as both reference and
query for 'nucmer,' then ran 'show-coords' to generate a delta file.  The
majority of the run-time was spent parsing the delta file and generating
the alignments from sequence pairs with significant alignments, using
'show-aligns.'

The final step was to generate a Postscript scatterplot.  I later rewrote
that part to generate 16 separate plots with 43x43 sequences each to make
them readable on standard paper size.

---cut---
#!/usr/bin/env perl -w
$pref="all";
$ref=$ARGV[0];
$qry=$ref;
system("nucmer --prefix=${pref} $ref $qry");
system("show-coords -rcl ${pref}.delta > ${pref}.coords");
open(DELTA,"<${pref}.delta");

while (<DELTA>) {
    next if $_ !~ /^>/;
    @inlin = split(/ /,$_);
    $inlin[0] =~ y/>//d;
    print STDERR "Processing $inlin[0] -> $inlin[1]\n";
    system("show-aligns ${pref}.delta $inlin[0] $inlin[1] >
${inlin[0]}_${inlin[1]}.aligns");
}

system("delta-filter -q -r ${pref}.delta > ${pref}.filter");
system("mummerplot ${pref}.delta -R $ref -Q $qry --layout -p ${pref} -S -t
postscript");
---cut---

Example alignment file output:

-- BEGIN alignment [ +1 89 - 187 | -1 11054 - 10956 ]

89         gccatcgcagagcttcgctaagctcactgaacgacagcagcagtatgct
11054      gccatcgcagagcttcgctaagctcactgagcggcagcagcagtatgct
                                         ^  ^

138        acgttcctctccctcgccgcctttgctggagcccccgtcctcttcgatc
11005      acgttcctctccctcgccgcctttgctggagcccccgtcctcttcgatc

187        a
10956      a

--   END alignment [ +1 89 - 187 | -1 11054 - 10956 ]

In this case, the alignment is forward versus reverse strands, with two
SNPs detected.

System: Sun Blade 2000 (2x900MHz SPARC), Solaris 10;
MUMmer 3.20, compiled 32-bit.

--
Larye D. Parkins
Information Engineering Services
PMB 435, 610 N. 1st St., Ste 5
Hamilton, MT 59840
http://www.info-engineering-svc.com

Making IT work since 1965.
Member of: ACM, IEEE Computer Society, USENIX, SAGE, LOPSA

On Tue, 19 Feb 2008, Mike Marchywka wrote:

> 
> > We have been using MUMmer3 (http://mummer.sourceforge.net) for rapid
> > alignments of whole genomes, genomes and contigs, and searching for
> 
> Thanks- that looks like a good tool that I didn't know about. I
> noticed they advertize e coli results prompting me to go back and
> check my own. I'd have to go check the suffix tree literature to see
> what exactly they claim to do in 17 seconds on e coli, but under
> cygwin, I was able to index all matching strings of length 25 or more,
> in about 67 seconds ,
> 
> $ date;$progpath/string_test -fastas both_fasta -index 8 -length 25
> -fix 12 -output 3 -filterN -filterID -status -fcompare_all> anchors
> ;date Sat Nov 10 18:45:23 EST 2007 string_test.cpp177 loaded 2 fastas
> Sat Nov 10 18:46:30 EST 2007
> 
> 
> and create a coarse alignment in another 25 seconds,
> 
> $ date; $progpath/mm_align_tool -fastas both_fasta -v -pair_rules
> anchors -doall -pair_align 0 -output text> align1 ;date Sat Nov 10
> 18:50:01 EST 2007 mm_hit_classes.h389 annotation_model.h57 Loaded
> 33373 pair rules. mm_align_tool.cpp309 Doing string PAIR align with
> cutoff 3 mm_align_tool.h227 do_all with only one rule, did you mean
> -mrules? mm_align_tool.cpp318 doing 0 vs 1 mm_align_tool.cpp326 do hit
> dump rules Sat Nov 10 18:50:26 EST 2007
> 
> 
> Do you have actual timing tests for various complete tasks or is 17
> seconds about it?  So, ok 67+25=92 seconds is not real impressive
> compared to 17, and I'm not sure how much I can blame cygwin for this
> :) I guess once I'm sure I have a useful algorithm, I can subtract IO
> time which has been significant in many cases. Someone also privately
> suggested blast's bl2seq and I would point out that this is quite fast
> on pairs of 50k sequences.
> 
> 
> 
> 
> Mike Marchywka
> 586 Saint James Walk
> Marietta GA 30067-7165
> 404-788-1216 (C)<- leave message
> 989-348-4796 (P)<- emergency only
> marchywka at hotmail.com
> Note: Hotmail is blocking my mom's entire
> ISP claiming it is to reduce spam but probably
> to force users to use hotmail. Please DON'T
> assume I am ignoring you and try
> me on marchywka at yahoo.com if no reply
> here. Thanks.
> 
> 
> _________________________________________________________________
> Shed those extra pounds with MSN and The Biggest Loser!
> http://biggestloser.msn.com/
> 
>