[Biococoa-dev] starting BCAlignment

Thu Mar 10 15:46:32 EST 2005

On 10-mrt-05, at 2:32, Koen van der Drift wrote:

> Hi,
>
> As I said before, I don't know much about coding alignments, and leave 
> that all up to you guys. I remember however, that Alex once mentioned 
> that he knows someone in his department who is very familiar with 
> alignment code. It could be an idea to see what he thinks about our 
> approach. Maybe he even wants to become a BioCocoa member :)
>
> - Koen.

Unfortunately, Serge isn't an expert on alignments, he is (in my eyes 
at least) a guru in programming complicated structural analysis 
programs, in working with huge datasets, difficult mathematics and 
great expertise in optimizing code using altivec, multiprocessors etc. 
Especially the latter are things we could benefit from in the case of 
alignments...
As I told some of you already, alignments have been the number one 
feature request for 4Peaks, and I have tried at least a dozen times to 
get it working. Two weeks ago, I finally managed to get it working and 
now it's a matter of implementing the GUI. And of course, now I tell 
people that I got it working they start asking already for contigs, 
aaarrrrgrrghghgghh.
Anyway, in the past year I've read and learned quite a bit of 
alignments, so let me share my knowledge for those not familiar. Now, 
again I'm not an expert so I'm fully confident that Philipp will 
correct me where I'm wrong and probably has some more things to add.
Basically there are two types of alignments, global and local.
It all started with the first, a global alignment, which aligns ALL 
symbols of sequence1 with sequence2. This algorithm generally used is 
the so-called Needleman-Wunsch algorithm after the geniuses that came 
up with the idea. So very simple, what you do is create a 2d matrix 
like this:
	//	    W   O   R   D   1
	//	
	//	W   0   0   0   0   0
	//	
	//	O   0   	
	//	
	//	R   0
	// 	
	//	D   0
	//	
	//	2   0

You start by filling the outer edges of the matrix with score 0, this 
is the setup phase.
Next , you start from the upper left with the socalled fill-phase in 
which you calculate the score for 1) a match 2) a shift in word1  3) a 
shift in word2
 From these 3 you pick the highest score leading to a direction 
(diagonal, left, up respectively)
This the allows you to calculate the score for the next position from 
the upper left corner, and like this you fill up the complete matrix
Then comes the so-called reverse-phase in which starting from the lower 
right corner you trace back via the directions set in the fill phase, 
which gives you the alignment.

Soon after, Smith and Waterman came up with a modification of this 
algorithm, the so-called local alignment. A few simple changes (like a 
score must always be larger than 0) removes the need to align all 
symbols, but instead the part that is most similar.

Now one of the biggest problems is quite obvious, as the aligned 
"words" get bigger the matrix and thus both the memory and time 
requirements increase quadratically. Rapidly becoming a disaster for 
your machine if these get to big. So a few clever heads came up with 
more modifications to work at least towards a sub-quadratic memory 
requirement, and this is where Philipp comes in ;-)

Now one thing more about matrices to explain John a bit more:
You can imagine that in the DNA world a (very simple) scoring scheme 
can be:
a match positive, e.g. +1
a mismatch negative, e.g. -1
A simple char comparison is all it takes to get the score.
But in the protein world there's more info as the change from aminoacid 
X to Y can be less or more important based on if they belong to the 
same chemical class or not. Based on analysis of mutations in many 
sequences, people have created substitution matrices with this point in 
mind (examples are PAM and BLOSUM). As for each score these matrices 
have to be accessed, for performance reasons they are usually of type 
int** (or char** but that's the same).

Perhaps this gives a bit of introduction in the terminology, again I 
rely on Philipp to make corrections wherever needed ;-)
I have attached an .m file illustrating the original Needleman-wunsch 
algorithm described above..
Cheers,
Alex

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 4669 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050310/4ee16f8f/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GlobalAlignment.m
Type: application/octet-stream
Size: 4269 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050310/4ee16f8f/attachment.obj>
-------------- next part --------------

*********************************************************
                      ** Alexander Griekspoor **
*********************************************************
                The Netherlands Cancer Institute
                Department of Tumorbiology (H4)
          Plesmanlaan 121, 1066 CX, Amsterdam
                   Tel:  + 31 20 - 512 2023
                   Fax:  + 31 20 - 512 2029
                  AIM: mekentosj at mac.com
                   E-mail: a.griekspoor at nki.nl
               Web: http://www.mekentosj.com

    The requirements said: Windows 2000 or better.
    So I got a Macintosh.

*********************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 673 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050310/4ee16f8f/attachment-0001.bin>