[Biococoa-dev] BCSequenceCluster

Charles Parnot charles.parnot at gmail.com
Sat Oct 8 00:20:38 EDT 2005

Thanks for the explanation about sequence clustering. This is  
actually more of a data compression trick, which would go on the  
implementation side. Again, I think we should worry about the  
implementation after we decide on the interface, and the needed classes.

Thanks Peter, for the code example, this is the kind of stuff I was  
thinking about. For the sake of simplicity, should not we have a  
single class for such related concepts as "BCSequenceGroup",  
"BCSequenceCluster", and "BCSequenceAlignment", or even, let's throw  
one more, "BCSequenceContig"? For all of these, we are just talking  
about a bunch of sequences positioned a specific way with respect to  
each other. I just want to make sure I am not missing something.
The conceptual differences are quite small, and I think to make the  
framework really taste like Cocoa, it has to have a simple interface.  
In fact, this is really the essence of what I was wondering about.  
How many classes do we need for all these related concepts? Is one  

Regarding the implementation, I would tend to prefer using gaps,  
because we already have the BCSymbol, we can easily generate  
sequences without them (good suggestion, Peter), and it is easy to  
use for display, comparisons,... On the other hand, using offsets  
renders some task like "get the symbols at position 827 of all the  
sequences" a bit hard (basically trading off simplicity and speed for  
data compression). But in the end, I would be happy with any  
implementation, as long as it works, particularly because I am  
probably not going to be doing much of it!!

so, just one class??


On Oct 7, 2005, at 1:53 AM, Peter Schols wrote:

> Hi Charles and Koen,
> First of all: I'm not a sequence cluster expert and I have never  
> used it for my own research.
> As far as I know, sequence clustering is a way to align homologous  
> sequences, detect regions of high sequence similarity and describe  
> the differences between sequences. This is especially useful to  
> create non-redundant datasets: for example, if you'd align  
> hemoglobin genes for 5 mammals, you will end up with almost 5 times  
> the same information. So if you'd just store one of the five  
> sequences and store the differences between this sequence (the  
> referenceSequence) and the other 4 sequences, you will save a lot  
> of memory and disk space (maybe not for this example but on a  
> genomic scale) without losing any information. It's highly  
> comparable to JPEG compression.
> Back to BC: I don't think we need this cluster functionality in  
> BioCocoa, at least for now. For the I/O methods we only need a good  
> container class for BCSequences as you are pointing out. With this  
> approach we would store the gaps directly inside the BCSequences  
> (using the gap symbol). This is definitely the easiest way to  
> implement it.
> The only thing we would need to do when going this route - to  
> compensate for the fact that sequences don't have gaps in reality -  
> is that we should add a method to BCSequence that returns the  
> sequence without gaps (the 'real' sequence).
> To answer Charles' question: right now, the only purpose for the  
> BCSequenceGroup would be to make I/O easier. But in the future, we  
> could add extra methods to this class to enable alignment of the  
> BCSequenceGroup (using BCAlignment) or to return a list of shared  
> indels, for example. This BCSequenceGroup could also be the perfect  
> class to pass as an argument to classes that do phylogenetic analysis.
> So in the future, we could do things like:
> BCSequenceGroup *group = [BCSequenceGroup  
> groupWithFile:@"myFastaFile.fst"];
> [group align];
> BCPhylogeneticTree *tree = [group  
> analyzeUsingHeuristicSearchWithReplicates: 1000];
> Cheers,
> Peter
> On 06 Oct 2005, at 23:45, Charles Parnot wrote:
>> At this point, given that I don't know that much about the fine  
>> details of sequence clusters and sequence groups, could you,  
>> Peter, take some time to explain exactly what the concept is, and  
>> also maybe come up with some examples of what it can be used for  
>> and how a user of the framework would want to use it. This way, we  
>> can define a header that does the job, and then worry about the  
>> implementation. In fact, I should have asked that question in the  
>> first place instead of pretending I understood what it was all about!
>> Sorry maybe this is a quite wide question. We don't have to go too  
>> deep at this point, as we merely want some I/O to work. However,  
>> in 'I/O', there is 'O' for output, so the question is: after  
>> loading a sequence from disk, what information will the user want  
>> to retrieve? Or will she just want to perform some operations on  
>> the sequence group and then move on?
>> cheers,
>> charles
>> On Oct 6, 2005, at 5:26 AM, Peter Schols wrote:
>>> Hi Koen,
>>>> In the case where the input sequences are already aligned, we're  
>>>> ready to go, I guess. Also note we already have a BCAlignment  
>>>> class which we could use. However, I have not yet adapted that  
>>>> to use the NSData structure. I didn't write the alignment code,  
>>>> and I don't want to mess up someone elses work ;-)
>>>> I added the representative sequence after looking up some info  
>>>> about sequence clusters (see eg http://en.wikipedia.org/wiki/ 
>>>> Sequence_clustering). But if you suggest it is not needed in  
>>>> this case, I'll be happy to remove it!
>>> I don't think we need the sequence cluster functionality for the  
>>> BCSequenceGroup class. We could reserve the name  
>>> BCSequenceCluster for such a class we could eventually create in  
>>> the future (if there is any need for this). The BCSequenceCluster  
>>> class could then inherit from BCSequenceGroup.
>>> Peter
>>> Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
>>> _______________________________________________
>>> Biococoa-dev mailing list
>>> Biococoa-dev at bioinformatics.org
>>> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>> --
>> Xgrid-at-Stanford
>> Help science move fast forward:
>> http://cmgm.stanford.edu/~cparnot/xgrid-stanford
>> Charles Parnot
>> charles.parnot at gmail.com
>> _______________________________________________
>> Biococoa-dev mailing list
>> Biococoa-dev at bioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/biococoa-dev
> Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

Help science move fast forward:

Charles Parnot
charles.parnot at gmail.com

More information about the Biococoa-dev mailing list