On 10-mrt-05, at 2:32, Koen van der Drift wrote:
Hi,
As I said before, I don't know much about coding alignments, and leave
that all up to you guys. I remember however, that Alex once mentioned
that he knows someone in his department who is very familiar with
alignment code. It could be an idea to see what he thinks about our
approach. Maybe he even wants to become a BioCocoa member :)
- Koen.
Unfortunately, Serge isn't an expert on alignments, he is (in my eyes
at least) a guru in programming complicated structural analysis
programs, in working with huge datasets, difficult mathematics and
great expertise in optimizing code using altivec, multiprocessors etc.
Especially the latter are things we could benefit from in the case of
alignments...
As I told some of you already, alignments have been the number one
feature request for 4Peaks, and I have tried at least a dozen times to
get it working. Two weeks ago, I finally managed to get it working and
now it's a matter of implementing the GUI. And of course, now I tell
people that I got it working they start asking already for contigs,
aaarrrrgrrghghgghh.
Anyway, in the past year I've read and learned quite a bit of
alignments, so let me share my knowledge for those not familiar. Now,
again I'm not an expert so I'm fully confident that Philipp will
correct me where I'm wrong and probably has some more things to add.
Basically there are two types of alignments, global and local.
It all started with the first, a global alignment, which aligns ALL
symbols of sequence1 with sequence2. This algorithm generally used is
the so-called Needleman-Wunsch algorithm after the geniuses that came
up with the idea. So very simple, what you do is create a 2d matrix
like this:
2321,6E6C,2523// W O R D 1
2321,6E6C,2523//
2321,6E6C,2523// W 0 0 0 0 0
2321,6E6C,2523//
2321,6E6C,2523// O 0
2321,6E6C,2523//
2321,6E6C,2523// R 0
2321,6E6C,2523//
2321,6E6C,2523// D 0
2321,6E6C,2523//
2321,6E6C,2523// 2 0
You start by filling the outer edges of the matrix with score
0, this is the setup phase.
Next , you start from the upper left with the socalled fill-phase in
which you calculate the score for 1) a match 2) a shift in word1 3) a
shift in word2
>From these 3 you pick the highest score leading to a direction
(diagonal, left, up respectively)
This the allows you to calculate the score for the next position from
the upper left corner, and like this you fill up the complete matrix
Then comes the so-called reverse-phase in which starting from the
lower right corner you trace back via the directions set in the fill
phase, which gives you the alignment.
Soon after, Smith and Waterman came up with a modification of this
algorithm, the so-called local alignment. A few simple changes (like a
score must always be larger than 0) removes the need to align all
symbols, but instead the part that is most similar.
Now one of the biggest problems is quite obvious, as the aligned
"words" get bigger the matrix and thus both the memory and time
requirements increase quadratically. Rapidly becoming a disaster for
your machine if these get to big. So a few clever heads came up with
more modifications to work at least towards a sub-quadratic memory
requirement, and this is where Philipp comes in ;-)
Now one thing more about matrices to explain John a bit more:
You can imagine that in the DNA world a (very simple) scoring scheme
can be:
a match positive, e.g. +1
a mismatch negative, e.g. -1
A simple char comparison is all it takes to get the score.
But in the protein world there's more info as the change from
aminoacid X to Y can be less or more important based on if they belong
to the same chemical class or not. Based on analysis of mutations in
many sequences, people have created substitution matrices with this
point in mind (examples are PAM and BLOSUM). As for each score these
matrices have to be accessed, for performance reasons they are usually
of type int** (or char** but that's the same).
Perhaps this gives a bit of introduction in the terminology, again I
rely on Philipp to make corrections wherever needed ;-)
I have attached an .m file illustrating the original Needleman-wunsch
algorithm described above..
Cheers,
Alex