On 10-mrt-05, at 2:32, Koen van der Drift wrote: Hi, As I said before, I don't know much about coding alignments, and leave that all up to you guys. I remember however, that Alex once mentioned that he knows someone in his department who is very familiar with alignment code. It could be an idea to see what he thinks about our approach. Maybe he even wants to become a BioCocoa member :) - Koen. Unfortunately, Serge isn't an expert on alignments, he is (in my eyes at least) a guru in programming complicated structural analysis programs, in working with huge datasets, difficult mathematics and great expertise in optimizing code using altivec, multiprocessors etc. Especially the latter are things we could benefit from in the case of alignments... As I told some of you already, alignments have been the number one feature request for 4Peaks, and I have tried at least a dozen times to get it working. Two weeks ago, I finally managed to get it working and now it's a matter of implementing the GUI. And of course, now I tell people that I got it working they start asking already for contigs, aaarrrrgrrghghgghh. Anyway, in the past year I've read and learned quite a bit of alignments, so let me share my knowledge for those not familiar. Now, again I'm not an expert so I'm fully confident that Philipp will correct me where I'm wrong and probably has some more things to add. Basically there are two types of alignments, global and local. It all started with the first, a global alignment, which aligns ALL symbols of sequence1 with sequence2. This algorithm generally used is the so-called Needleman-Wunsch algorithm after the geniuses that came up with the idea. So very simple, what you do is create a 2d matrix like this: 2321,6E6C,2523// W O R D 1 2321,6E6C,2523// 2321,6E6C,2523// W 0 0 0 0 0 2321,6E6C,2523// 2321,6E6C,2523// O 0 2321,6E6C,2523// 2321,6E6C,2523// R 0 2321,6E6C,2523// 2321,6E6C,2523// D 0 2321,6E6C,2523// 2321,6E6C,2523// 2 0 You start by filling the outer edges of the matrix with score 0, this is the setup phase. Next , you start from the upper left with the socalled fill-phase in which you calculate the score for 1) a match 2) a shift in word1 3) a shift in word2 >From these 3 you pick the highest score leading to a direction (diagonal, left, up respectively) This the allows you to calculate the score for the next position from the upper left corner, and like this you fill up the complete matrix Then comes the so-called reverse-phase in which starting from the lower right corner you trace back via the directions set in the fill phase, which gives you the alignment. Soon after, Smith and Waterman came up with a modification of this algorithm, the so-called local alignment. A few simple changes (like a score must always be larger than 0) removes the need to align all symbols, but instead the part that is most similar. Now one of the biggest problems is quite obvious, as the aligned "words" get bigger the matrix and thus both the memory and time requirements increase quadratically. Rapidly becoming a disaster for your machine if these get to big. So a few clever heads came up with more modifications to work at least towards a sub-quadratic memory requirement, and this is where Philipp comes in ;-) Now one thing more about matrices to explain John a bit more: You can imagine that in the DNA world a (very simple) scoring scheme can be: a match positive, e.g. +1 a mismatch negative, e.g. -1 A simple char comparison is all it takes to get the score. But in the protein world there's more info as the change from aminoacid X to Y can be less or more important based on if they belong to the same chemical class or not. Based on analysis of mutations in many sequences, people have created substitution matrices with this point in mind (examples are PAM and BLOSUM). As for each score these matrices have to be accessed, for performance reasons they are usually of type int** (or char** but that's the same). Perhaps this gives a bit of introduction in the terminology, again I rely on Philipp to make corrections wherever needed ;-) I have attached an .m file illustrating the original Needleman-wunsch algorithm described above.. Cheers, Alex