[Biococoa-dev] starting BCAlignment
Philipp Seibel
biococoa at bioworxx.com
Thu Mar 10 16:05:55 EST 2005
Am 10.03.2005 um 21:46 schrieb Alexander Griekspoor:
> On 10-mrt-05, at 2:32, Koen van der Drift wrote:
>
>> Hi,
>>
>> As I said before, I don't know much about coding alignments, and
>> leave that all up to you guys. I remember however, that Alex once
>> mentioned that he knows someone in his department who is very
>> familiar with alignment code. It could be an idea to see what he
>> thinks about our approach. Maybe he even wants to become a BioCocoa
>> member :)
>>
>> - Koen.
>
> Unfortunately, Serge isn't an expert on alignments, he is (in my eyes
> at least) a guru in programming complicated structural analysis
> programs, in working with huge datasets, difficult mathematics and
> great expertise in optimizing code using altivec, multiprocessors etc.
> Especially the latter are things we could benefit from in the case of
> alignments...
> As I told some of you already, alignments have been the number one
> feature request for 4Peaks, and I have tried at least a dozen times to
> get it working. Two weeks ago, I finally managed to get it working and
> now it's a matter of implementing the GUI. And of course, now I tell
> people that I got it working they start asking already for contigs,
> aaarrrrgrrghghgghh.
> Anyway, in the past year I've read and learned quite a bit of
> alignments, so let me share my knowledge for those not familiar. Now,
> again I'm not an expert so I'm fully confident that Philipp will
> correct me where I'm wrong and probably has some more things to add.
> Basically there are two types of alignments, global and local.
> It all started with the first, a global alignment, which aligns ALL
> symbols of sequence1 with sequence2. This algorithm generally used is
> the so-called Needleman-Wunsch algorithm after the geniuses that came
> up with the idea. So very simple, what you do is create a 2d matrix
> like this:
> // W O R D 1
> //
> // W 0 0 0 0 0
> //
> // O 0
> //
> // R 0
> //
> // D 0
> //
> // 2 0
>
> You start by filling the outer edges of the matrix with score 0, this
> is the setup phase.
> Next , you start from the upper left with the socalled fill-phase in
> which you calculate the score for 1) a match 2) a shift in word1 3) a
> shift in word2
> From these 3 you pick the highest score leading to a direction
> (diagonal, left, up respectively)
> This the allows you to calculate the score for the next position from
> the upper left corner, and like this you fill up the complete matrix
> Then comes the so-called reverse-phase in which starting from the
> lower right corner you trace back via the directions set in the fill
> phase, which gives you the alignment.
>
this part is also known as dynamic programming ( for those who want to
google for more information )
> Soon after, Smith and Waterman came up with a modification of this
> algorithm, the so-called local alignment. A few simple changes (like a
> score must always be larger than 0) removes the need to align all
> symbols, but instead the part that is most similar.
>
this is exactly the same approach except from one thing, you look for
the subpath with the highest score.
> Now one of the biggest problems is quite obvious, as the aligned
> "words" get bigger the matrix and thus both the memory and time
> requirements increase quadratically. Rapidly becoming a disaster for
> your machine if these get to big. So a few clever heads came up with
> more modifications to work at least towards a sub-quadratic memory
> requirement, and this is where Philipp comes in ;-)
>
ok here i'm. what should i say, you're again completly right ;-). To
get subquadratic is as far as i know only possible with heuristics, but
in this case you loose accuracy. I think we should first implement the
basic algorithms.
> Now one thing more about matrices to explain John a bit more:
> You can imagine that in the DNA world a (very simple) scoring scheme
> can be:
> a match positive, e.g. +1
> a mismatch negative, e.g. -1
> A simple char comparison is all it takes to get the score.
> But in the protein world there's more info as the change from
> aminoacid X to Y can be less or more important based on if they belong
> to the same chemical class or not. Based on analysis of mutations in
> many sequences, people have created substitution matrices with this
> point in mind (examples are PAM and BLOSUM). As for each score these
> matrices have to be accessed, for performance reasons they are usually
> of type int** (or char** but that's the same).
>
I think we should use a int* instead of int** because its faster. Take
a look at my BCScoringMatrix.
> Perhaps this gives a bit of introduction in the terminology, again I
> rely on Philipp to make corrections wherever needed ;-)
> I have attached an .m file illustrating the original Needleman-wunsch
> algorithm described above..
> Cheers,
> Alex
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 5370 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050310/d89f75db/attachment.bin>
More information about the Biococoa-dev
mailing list