[Biococoa-dev] More on sequence formats

Peter Schols peter.schols at bio.kuleuven.be
Tue Apr 11 08:58:43 EDT 2006

Hi guys,

We have been victim of a burglary on Sunday (quite ironically), so  
apologies for the delay in getting back.
Congratulations to Koen, I have had a look at the latest rev. and  
it's looking very good already!

I'll try to answer your questions, although it has been almost 3  
years since l wrote most of the I/O code.

> 1. What's the difference between Nexus and Nexusfileandblocks ?

The Nexus method only returns the 'taxa' (= names) and the sequences  
Because the Nexus format has a very rich vocabulary (organized in  
blocks), there is also the Nexusfileandblocks method that not only  
returns the above information but also returns a list of the blocks.  
These blocks are simply returned as strings (exactly as they are  
found in the file format). This should enable other developers to  
look for specific blocks and handle them appropriately.

> 2. How is the nona format defined, I couldn't find anything about  
> this?

Tough answer because there is no real definition. The nona format  
(much like the TNT format) sometimes starts with proc/ although this  
is not always the case.

> 3. The MSF file now uses the string "Pileup" as a selector.  
> However, when searching for the format definition, I found that  
> this format uses a '!!NA' or '!!AA' instead. But I may have found  
> the wrong info, so if anyone knows which is correct, please let me  
> know.

That's right, '!!NA' or '!!AA'  are indeed the format identifiers.  
See also: http://www.compbio.ox.ac.uk/faq/format_examples.shtml
So it would be better to replace the Pileup check as I have  
implemented with the '!!NA' or '!!AA'  checks indeed.

> 4. I am thinking about adding a plist file to the framework that  
> contains all the file extensions of possible sequence files. This  
> can then be used in openPanel (see the code that Alex supplied).  
> The nice thing about this is, is that we can synchronize the  
> entries with the methods in BCSequenceReader. Any reason I should  
> not do this?

Great idea. Although it should complement - and not replace - the  
format checks in the files themselves, at least for the text based  
files. Not all files use (the correct) extensions.

> For those who feel like helping out, the way to implement the code is:
> - remove white lines (optional)
> - get each line
> - extract annotations into a BCAnnotationsArray
> - extract the sequence(s) into an NSString
> - once done with all the sequences, create a BCSequence from each  
> sequenceString
> - add the annotations to each BCSequence
> - add the new BCSequence(s) to the BCSequenceArray
> - return the BCSequenceArray

I would love to help out when I'm done with the insurance forms and  
other paperwork...


Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

More information about the Biococoa-dev mailing list