[Biococoa-dev] Even more on sequence formats
Alexander Griekspoor
a.griekspoor at nki.nl
Tue Apr 11 05:45:31 EDT 2006
Hi Koen,
Good work on the sequence reader!
>
> I have a couple of more questions before I can continue
>
> 1. What's the difference between Nexus and Nexusfileandblocks ?
>
> 2. How is the nona format defined, I couldn't find anything about
> this?
>
> 3. The MSF file now uses the string "Pileup" as a selector.
> However, when searching for the format definition, I found that
> this format uses a '!!NA' or '!!AA' instead. But I may have found
> the wrong info, so if anyone knows which is correct, please let me
> know.
>
On 1-3 Peter should know most questions I guess, he wrote the
original stuff ;-).
> 4. I am thinking about adding a plist file to the framework that
> contains all the file extensions of possible sequence files. This
> can then be used in openPanel (see the code that Alex supplied).
> The nice thing about this is, is that we can synchronize the
> entries with the methods in BCSequenceReader. Any reason I should
> not do this?
Sounds great, that allows us also to update BioCocoa with new read
methods and existing programs can just change the framework and new
support is added to their program.
I'm also fixing some stuff for EnzymeX in the reader classes, the
moment it's done I'll post you (koen) the changes and you can sync
them with your work:
- I've added support for reading sequence files that weren't saved as
plain-text but as rtf, basically adding a check and converting the
file to plain text before continuing with the normal format
determination
- I've changed the raw reading method such that it becomes more
greedy. Peter's variant reads in all lines as separate entries in the
matrix dictionary (which is probably what you want in aligned
phylogenetic sequence files, but not in EnzymeX where people usually
read in a single sequence file. So I remove all return characters
first. I had one person complaining that EnzX only read the first
line when he tried to open his nicely in 80 char columns formatted
text file.
Now the question is what the difference
- I am fixing the binary file format reading (Strider/GCK) to make
them universal compatible, currently they fail on an Intel based macs
due to endian-issues
Finally, to come back to the question on the formats, perhaps we can
learn from a classic sequence reader package called ReadSeq by
d.g.gilbert.
It reads the following formats, which are outlined in the Formats
textfile inside the src folder:
1. IG/Stanford 10. Olsen (in-only)
2. GenBank/GB 11. Phylip3.2
3. NBRF 12. Phylip
4. EMBL 13. Plain/Raw
5. GCG 14. PIR/CODATA
6. DNAStrider 15. MSF
7. Fitch 16. ASN.1
8. Pearson/Fasta 17. PAUP
9. Zuker (in-only) 18. Pretty (out-only)
Some of them we support, but some not, so we can even add a few
formats, plus the source code nicely shows how to discriminate them.
The latest version switched from c to java and added even a few more
formats, so there's plenty to add ;-) The source also contains many
sample files for testing purposes.
I'm not sure where it can be found nowadays, so I put it temporarily
on our server for you guys to download:
http://www.mekentosj.com/temporary/readseq.zip
Have a look at it and tell me what you think.
Cheers,
Alex
> For those who feel like helping out, the way to implement the code is:
>
> - remove white lines (optional)
> - get each line
> - extract annotations into a BCAnnotationsArray
> - extract the sequence(s) into an NSString
> - once done with all the sequences, create a BCSequence from each
> sequenceString
> - add the annotations to each BCSequence
> - add the new BCSequence(s) to the BCSequenceArray
> - return the BCSequenceArray
>
>
> cheers,
>
> - Koen.
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>
**************************************************************
** Alexander Griekspoor **
**************************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
MacOS X: The power of UNIX with the simplicity of the Mac
***************************************************************
*********************************************************
** Alexander Griekspoor **
*********************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
iRNAi, do you?
http://www.mekentosj.com/irnai
*********************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20060411/4c2b3209/attachment.html>
More information about the Biococoa-dev
mailing list