[Biococoa-dev] More on BCSymbolSets

Fri Mar 4 19:08:02 EST 2005

On Mar 4, 2005, at 10:50 AM, John Timmer wrote:

>

> For example, let's say we provide all combinations of symbol sets 
> using only
> the single bases (ATCG), those plus N, those plus N and gap, those 
> plus, N,
> gap, and undefined, etc.  You're easily up to about a dozen symbol 
> sets for
> DNA alone.  Then you add RNA, and protein, and you're probably in the 
> area
> of 25.

But each set will only have nucleotides or amino acids. They are not 
intended to mix.

>
> Now, you need to do a restriction digest.  That only works with DNA, 
> so you
> need to know if you have a DNA sequence.  There's no easy way to do 
> this
> with just a symbol set.

Why not? Just test if the sequences' symbolset contains nucleotides, 
excluding 'U'. No need to go through each symbol in every symbolset. We 
could even move the sequenceType to the BCSymbolSet class as suggested 
by Charles. That way we just need a convenience method to check.

>  You'd have to either iterate through all its
> symbols and determine whether they're all DNA nucleotides, or iterate
> through all the DNA symbol set singletons and test for equality to the 
> set
> that the sequence is using.  Translation's even worse, since it works 
> with
> DNA and RNA.

Actually, that's even easier, you only need to check if the symbolset 
contains nucleotides!

>
> I don't see how you can avoid iteration, but you feel you can, so 
> maybe i'm
> missing something.  Your alternative, "containsNucleotides" is fine, 
> but we
> already have the other system in place -  it's simple, and it works, 
> so I
> don't see the need to redo it.

What other system are you referring to?

>
>
> Anyway, as an aside, i've been thinking that the symbol set structure 
> would
> allow for a nice encapsulation of a genetic code.  The problem is that
> codons aren't symbols (since they have both amino acid and nucleotide
> information).  Any suggestions on how to adapt things?

Not saying that this should be the way that we should do this, but I 
remember that BioJava uses cross-alphabets for this. While googling for 
that I found this short explanation:

http://www.biojava.org/docs/bj_in_anger/crossProd.htm

I also came accross this from our friend at biopython, who went through 
the same process of finding out what would be the best way to implement 
the variuos sequences:

http://biopython.org/pipermail/biopython/2000-March/000190.html

- Koen.