[Biococoa-dev] More on sequence formats
peter.schols at bio.kuleuven.be
Tue Apr 11 08:58:43 EDT 2006
We have been victim of a burglary on Sunday (quite ironically), so
apologies for the delay in getting back.
Congratulations to Koen, I have had a look at the latest rev. and
it's looking very good already!
I'll try to answer your questions, although it has been almost 3
years since l wrote most of the I/O code.
> 1. What's the difference between Nexus and Nexusfileandblocks ?
The Nexus method only returns the 'taxa' (= names) and the sequences
Because the Nexus format has a very rich vocabulary (organized in
blocks), there is also the Nexusfileandblocks method that not only
returns the above information but also returns a list of the blocks.
These blocks are simply returned as strings (exactly as they are
found in the file format). This should enable other developers to
look for specific blocks and handle them appropriately.
> 2. How is the nona format defined, I couldn't find anything about
Tough answer because there is no real definition. The nona format
(much like the TNT format) sometimes starts with proc/ although this
is not always the case.
> 3. The MSF file now uses the string "Pileup" as a selector.
> However, when searching for the format definition, I found that
> this format uses a '!!NA' or '!!AA' instead. But I may have found
> the wrong info, so if anyone knows which is correct, please let me
That's right, '!!NA' or '!!AA' are indeed the format identifiers.
See also: http://www.compbio.ox.ac.uk/faq/format_examples.shtml
So it would be better to replace the Pileup check as I have
implemented with the '!!NA' or '!!AA' checks indeed.
> 4. I am thinking about adding a plist file to the framework that
> contains all the file extensions of possible sequence files. This
> can then be used in openPanel (see the code that Alex supplied).
> The nice thing about this is, is that we can synchronize the
> entries with the methods in BCSequenceReader. Any reason I should
> not do this?
Great idea. Although it should complement - and not replace - the
format checks in the files themselves, at least for the text based
files. Not all files use (the correct) extensions.
> For those who feel like helping out, the way to implement the code is:
> - remove white lines (optional)
> - get each line
> - extract annotations into a BCAnnotationsArray
> - extract the sequence(s) into an NSString
> - once done with all the sequences, create a BCSequence from each
> - add the annotations to each BCSequence
> - add the new BCSequence(s) to the BCSequenceArray
> - return the BCSequenceArray
I would love to help out when I'm done with the insurance forms and
More information about the Biococoa-dev