[Biococoa-dev] Even more on sequence formats

Tue Apr 11 05:45:31 EDT 2006

Hi Koen,

Good work on the sequence reader!
>
> I have a couple of more questions before I can continue
>
> 1. What's the difference between Nexus and Nexusfileandblocks ?
>
> 2. How is the nona format defined, I couldn't find anything about  
> this?
>
> 3. The MSF file now uses the string "Pileup" as a selector.  
> However, when searching for the format definition, I found that  
> this format uses a '!!NA' or '!!AA' instead. But I may have found  
> the wrong info, so if anyone knows which is correct, please let me  
> know.
>
On 1-3 Peter should know most questions I guess, he wrote the  
original stuff ;-).

> 4. I am thinking about adding a plist file to the framework that  
> contains all the file extensions of possible sequence files. This  
> can then be used in openPanel (see the code that Alex supplied).  
> The nice thing about this is, is that we can synchronize the  
> entries with the methods in BCSequenceReader. Any reason I should  
> not do this?

Sounds great, that allows us also to update BioCocoa with new read  
methods and existing programs can just change the framework and new  
support is added to their program.

I'm also fixing some stuff for EnzymeX in the reader classes, the  
moment it's done I'll post you (koen) the changes and you can sync  
them with your work:
- I've added support for reading sequence files that weren't saved as  
plain-text but as rtf, basically adding a check and converting the  
file to plain text before continuing with the normal format  
determination
- I've changed the raw reading method such that it becomes more  
greedy. Peter's variant reads in all lines as separate entries in the  
matrix dictionary (which is probably what you want in aligned  
phylogenetic sequence files, but not in EnzymeX where people usually  
read in a single sequence file. So I remove all return characters  
first. I had one person complaining that EnzX only read the first  
line when he tried to open his nicely in 80 char columns formatted  
text file.
Now the question is what the difference
- I am fixing the binary file format reading (Strider/GCK) to make  
them universal compatible, currently they fail on an Intel based macs  
due to endian-issues

Finally, to come back to the question on the formats, perhaps we can  
learn from a classic sequence reader package called ReadSeq by  
d.g.gilbert.
It reads the following formats, which are outlined in the Formats  
textfile inside the src folder:
          1. IG/Stanford           10. Olsen (in-only)
          2. GenBank/GB            11. Phylip3.2
          3. NBRF                  12. Phylip
          4. EMBL                  13. Plain/Raw
          5. GCG                   14. PIR/CODATA
          6. DNAStrider            15. MSF
          7. Fitch                 16. ASN.1
          8. Pearson/Fasta         17. PAUP
          9. Zuker (in-only)       18. Pretty (out-only)

Some of them we support, but some not, so we can even add a few  
formats, plus the source code nicely shows how to discriminate them.
The latest version switched from c to java and added even a few more  
formats, so there's plenty to add ;-) The source also contains many  
sample files for testing purposes.
I'm not sure where it can be found nowadays, so I put it temporarily  
on our server for you guys to download:
http://www.mekentosj.com/temporary/readseq.zip
Have a look at it and tell me what you think.
Cheers,
Alex

> For those who feel like helping out, the way to implement the code is:
>
> - remove white lines (optional)
> - get each line
> - extract annotations into a BCAnnotationsArray
> - extract the sequence(s) into an NSString
> - once done with all the sequences, create a BCSequence from each  
> sequenceString
> - add the annotations to each BCSequence
> - add the new BCSequence(s) to the BCSequenceArray
> - return the BCSequenceArray
>
>
> cheers,
>
> - Koen.
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>

**************************************************************
                         ** Alexander Griekspoor **
**************************************************************
                  The Netherlands Cancer Institute
                  Department of Tumorbiology (H4)
             Plesmanlaan 121, 1066 CX, Amsterdam
                        Tel:  + 31 20 - 512 2023
                        Fax:  + 31 20 - 512 2029
                       AIM: mekentosj at mac.com
                       E-mail: a.griekspoor at nki.nl
                    Web: http://www.mekentosj.com

MacOS X: The power of UNIX with the simplicity of the Mac

***************************************************************

*********************************************************
                     ** Alexander Griekspoor **
*********************************************************
               The Netherlands Cancer Institute
               Department of Tumorbiology (H4)
          Plesmanlaan 121, 1066 CX, Amsterdam
                   Tel:  + 31 20 - 512 2023
                   Fax:  + 31 20 - 512 2029
                   AIM: mekentosj at mac.com
                   E-mail: a.griekspoor at nki.nl
               Web: http://www.mekentosj.com

                             iRNAi, do you?
              http://www.mekentosj.com/irnai

*********************************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20060411/4c2b3209/attachment.html>