[Biococoa-dev] BCSequenceReader

Sat Nov 13 09:44:30 EST 2004

> 
> On Nov 10, 2004, at 7:42 PM, Koen van der Drift wrote:
> 
>> But how do I figure out which of the two I am dealing with, so I can
>> return the proper subclass of BCSequence?
> 
> 
> I asked the same question on the biojava mailinglist, just to get an
> idea how they have solved this. And guess what? They haven't ;-)

I wrote everything below (which I'm including just for other ideas if this
turns out to be a bad one) when suddenly a simple answer hit me.

All the sequence classes use [symbol undefined] of the appropriate subclass
if they hit a character they can't recognize.  Koen also put the
sequenceCountedSet code in.  Simply send the string to each of the three
sequence classes, then use the counted set to determine the one which
results in the fewest undefined symbols.  If the number turns out to be
equal,  use DNA > RNA > protein to decide which sequence to use so that we
can stay within the central dogma.

The code should be very clean and easy to follow, though it may not be as
fast as I'd like, given there's three sequence objects created and looped
through.

My previous thoughts follow - disregard them unless you think the above is a
bad idea:

With the sequence in an uppercased NSString, try:

"rangeOfCharactersInSet:"  using "ATCGN" - >DNA if range.length =
string.length

"rangeOfCharactersInSet:"  using "AUCGN" - >RNA if range.length =
string.length

That'll handle the easiest cases first and keep things responsive if we have
a simple case.  After that, I think a loop using that method could be the
quickest way to figure out the AT/UCGN percentage - the logic would look
like:

Get range
Add range.length to total

Loop {
    Get range starting from range.length + range.location + 1
    Add range.length to total
    make sure you're not at the end of the string
}
Percent = total/string length

We could just declare a cutoff - anything over 80% would be made into a
nucleotide, otherwise a protein.

Where this is going to work poorly is very short sequences, like restriction
sites - I think we should only enter this code if the sequence is over 10bp
or so.  Maybe we should just treat anything under 10 characters as a
protein?

One other thought - I know the nucleotides have a non-base character, and
you also have code for

And that guy who answered your email was VERY optimistic in assuming there's
an accession number in the comment field....

Cheers,

Jay

_______________________________________________
This mind intentionally left blank