[BiO BB] Need fair alignment tool comparison/ using DSCAM for tool testing
larye at info-engineering-svc.com
Mon Feb 18 16:47:00 EST 2008
Mike Marchywka wrote:
> As I mentioned in previous posts, I'm using the drosophila DSCAM genes for testing some tools.
> I assembled a fasta file composed of 3 fly entries,
> $ cat all_fasta | grep ">"
>>AF260530 Drosophila melanogaster Dscam gene, complete cds.
>>DQ317106 Drosophila yakuba Dscam gene, exons 3 through 24.
>>DQ317109 Drosophila pseudoobscura Dscam gene, exons 3 through 24.
> and tried aligning them with clustalw but minutes later still didn't have a result. I was wondering if
> someone could suggest a set of parameters or alternative alignment tool to do a fast
> alignment, even if a bit sloppy. I had always used to slow/accurate approach and don't
> know what options may be available for faster work- these sequences are each about 50k long.
We have been using MUMmer3 (http://mummer.sourceforge.net) for rapid
alignments of whole genomes, genomes and contigs, and searching for
repeats and inverted repeats in multiple sequences. MUMmer is very fast
and has nucleotide and translated protein modes, as well as scatterplot
graphical output, so is very good for finding regions of high identity
in large sequences and graphically highlighting areas of interest.
> In the meantime, I was able to get a satisfactory result using exact string matches using successively
> shorter and shorter strings. This approach yields acceptable results in under a minute and, if needed, you
> could segment the questionable areas and feed them to clustal or other tool for "better" alignment.
> It seems to be fast due to only comparing sequences to a reference sequence ( O(n*l^2) but "l" can be smaller
> than sequence length as unique features can be found O(l*log(l)) ) . There are, of course, likely to
> be various pathological cases but for sequences known to be similar it seems to work ok and the indexing
> feature allows extraction of substrings with particular distributions ( occuring only once in each sample for example).
> I have aligned 2 ecoli strains in perhaps a few minutes and there weren't any obvious pathological
> results ( I obviously didn't check the whole thing either by eye or programatically).
> Others have asked about testing method, I'd like to show how I'm going about this with the DSCAM example.
> The alignment is only one part of more general interest in finding similar/different features between samples.
> These sequences, it turns out, have exon locations in the ncbi entries. So, it was pretty easy to check the alignments
> by examining the locations of the exons in the aligned composite. In this case, I aligned as follows,
> I'm aware of the following related alignment literature, open to ideas:
> $ string_test -about|unix2dos >/dev/clipboard
> Contact: marchywka at hotmail.com Nov 2007
> Comment: uses some indexing to get speed up,
> Comment: motivation for RC rules from this etc ,
> Commment: and should work well on text or (modified slightly ) binary code too
> Note: More code in mm_align_tool
> Note: Based loosely on references such as these but 'common sense'
> Note: seemed to work well as these are after-the-fact lookups
> Ref: http://www.google.com/search?hl=en&safe=off&q=string+alignment+site%3Aciteseer.ist.psu.edu
> Ref: http://citeseer.ist.psu.edu/csuros05rapid.html
> Comment: Csuros, M., Ma, B.: Rapid homology search with two-stage extension and
> Comment: daughter seeds. In: Proc. 11th Int. Computing and Combinatorics Conf. (COCOON).
> Comment: Volume 3595 of LNCS., Springer-Verlag (2005) 104-- 114
> Ref: http://citeseer.ist.psu.edu/468459.html
> Ref: http://citeseer.ist.psu.edu/kahveci04speeding.html
> Feb 2 2008 09:35:40 string_test.h182
> Mike Marchywka
> 586 Saint James Walk
> Marietta GA 30067-7165
> 404-788-1216 (C)<- leave message
> 989-348-4796 (P)<- emergency only
> marchywka at hotmail.com
> Note: Hotmail is blocking my mom's entire
> ISP claiming it is to reduce spam but probably
> to force users to use hotmail. Please DON'T
> assume I am ignoring you and try
> me on marchywka at yahoo.com if no reply
> here. Thanks.
> Need to know the score, the latest news, or you need your Hotmail®-get your "fix".
> BBB mailing list
> BBB at bioinformatics.org
Larye D. Parkins
Information Engineering Services
PMB 435, 610 N. 1st St., Ste 5
Hamilton, MT 59840
Making IT work since 1965.
Member of: ACM, IEEE Computer Society, USENIX, SAGE, LOPSA
More information about the BBB