[Biococoa-dev] starting BCAlignment

Thu Mar 10 16:05:55 EST 2005

Am 10.03.2005 um 21:46 schrieb Alexander Griekspoor:

> On 10-mrt-05, at 2:32, Koen van der Drift wrote:
>
>> Hi,
>>
>> As I said before, I don't know much about coding alignments, and 
>> leave that all up to you guys. I remember however, that Alex once 
>> mentioned that he knows someone in his department who is very 
>> familiar with alignment code. It could be an idea to see what he 
>> thinks about our approach. Maybe he even wants to become a BioCocoa 
>> member :)
>>
>> - Koen.
>
> Unfortunately, Serge isn't an expert on alignments, he is (in my eyes 
> at least) a guru in programming complicated structural analysis 
> programs, in working with huge datasets, difficult mathematics and 
> great expertise in optimizing code using altivec, multiprocessors etc. 
> Especially the latter are things we could benefit from in the case of 
> alignments...
> As I told some of you already, alignments have been the number one 
> feature request for 4Peaks, and I have tried at least a dozen times to 
> get it working. Two weeks ago, I finally managed to get it working and 
> now it's a matter of implementing the GUI. And of course, now I tell 
> people that I got it working they start asking already for contigs, 
> aaarrrrgrrghghgghh.
> Anyway, in the past year I've read and learned quite a bit of 
> alignments, so let me share my knowledge for those not familiar. Now, 
> again I'm not an expert so I'm fully confident that Philipp will 
> correct me where I'm wrong and probably has some more things to add.
> Basically there are two types of alignments, global and local.
> It all started with the first, a global alignment, which aligns ALL 
> symbols of sequence1 with sequence2. This algorithm generally used is 
> the so-called Needleman-Wunsch algorithm after the geniuses that came 
> up with the idea. So very simple, what you do is create a 2d matrix 
> like this:
> 	//	    W   O   R   D   1
> 	//	
> 	//	W   0   0   0   0   0
> 	//	
> 	//	O   0   	
> 	//	
> 	//	R   0
> 	// 	
> 	//	D   0
> 	//	
> 	//	2   0
>
> You start by filling the outer edges of the matrix with score 0, this 
> is the setup phase.
> Next , you start from the upper left with the socalled fill-phase in 
> which you calculate the score for 1) a match 2) a shift in word1  3) a 
> shift in word2
> From these 3 you pick the highest score leading to a direction 
> (diagonal, left, up respectively)
> This the allows you to calculate the score for the next position from 
> the upper left corner, and like this you fill up the complete matrix
> Then comes the so-called reverse-phase in which starting from the 
> lower right corner you trace back via the directions set in the fill 
> phase, which gives you the alignment.
>
this part is also known as dynamic programming ( for those who want to 
google for more information )

> Soon after, Smith and Waterman came up with a modification of this 
> algorithm, the so-called local alignment. A few simple changes (like a 
> score must always be larger than 0) removes the need to align all 
> symbols, but instead the part that is most similar.
>
this is exactly the same approach except from one thing, you look for 
the subpath with the highest score.

> Now one of the biggest problems is quite obvious, as the aligned 
> "words" get bigger the matrix and thus both the memory and time 
> requirements increase quadratically. Rapidly becoming a disaster for 
> your machine if these get to big. So a few clever heads came up with 
> more modifications to work at least towards a sub-quadratic memory 
> requirement, and this is where Philipp comes in ;-)
>
ok here i'm. what should i say, you're again completly right ;-). To 
get subquadratic is as far as i know only possible with heuristics, but 
in this case you loose accuracy. I think we should first implement the 
basic algorithms.

> Now one thing more about matrices to explain John a bit more:
> You can imagine that in the DNA world a (very simple) scoring scheme 
> can be:
> a match positive, e.g. +1
> a mismatch negative, e.g. -1
> A simple char comparison is all it takes to get the score.
> But in the protein world there's more info as the change from 
> aminoacid X to Y can be less or more important based on if they belong 
> to the same chemical class or not. Based on analysis of mutations in 
> many sequences, people have created substitution matrices with this 
> point in mind (examples are PAM and BLOSUM). As for each score these 
> matrices have to be accessed, for performance reasons they are usually 
> of type int** (or char** but that's the same).
>
I think we should use a int* instead of int** because its faster. Take 
a look at my BCScoringMatrix.

> Perhaps this gives a bit of introduction in the terminology, again I 
> rely on Philipp to make corrections wherever needed ;-)
> I have attached an .m file illustrating the original Needleman-wunsch 
> algorithm described above..
> Cheers,
> Alex
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 5370 bytes
Desc: not available
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050310/d89f75db/attachment.bin>