[Biococoa-dev] Even more on sequence formats

Wed Jul 19 06:05:56 EDT 2006

Hi Don,

Many thanks for your comments and suggestions! Peter had already  
implemented most of the popular sequence formats in the original  
BioCocoa framework, but we'll have a look at your Java framework to  
see if we could add or improve relevant sequence formats in BioCocoa.
Again, many thanks for the feedback and keep up the good work!
Best wishes,
Alex

On 14-jul-2006, at 19:16, don gilbert wrote:

> Dear biococoa folks,
>
> > Finally, to come back to the question on the formats, perhaps we
> > can learn from a classic sequence reader package called ReadSeq by
> > d.g.gilbert.
> > I'm not sure where it can be found nowadays, so I put it
> > temporarily on our server for you guys to download:
>
> Readseq is available as it has been for ~ 15 years from
> ftp://iubio.bio.indiana.edu/molbio/readseq/  but see the readseq/ 
> java/ version
> which I still update, but haven't had time recently to add new  
> formats.  The
> java version has more/better formats and documentation (and parser  
> fixes),
> but is also more complex than the C version.
>
> The help document you list is from the 1992 C version, which I  
> don't update.
> Also,
> ----------------
> PUBLIC DOMAIN NOTICE:
> This software is freely available to the public for use. The
> author, Don Gilbert, of Readseq and the Java package
> 'iubio.readseq' does not place any restriction on its use or
> reproduction. Developers are encourged to incorporate parts in
> their programs. I would appreciate being cited in any work or
> product based on this material. This software is provided without
> warranty of any kind.
> ------------------
>
> > I guess we have most of them, except for IG, NBRF, Fitch, Zuker,
> > Olsen, ASN.1.
> With exception of NCBI's ASN.1, which requires also the NCBI toolkit
> linked in with readseq, the others are essentially obsolete unless one
> still uses old 1990's era molbio software (IG = Intelligenetics;  
> NBRF = a PIR variant;
> Fitch from some classic molbio software (1970s? or 1980s) but a  
> poor format;
> Zuker from Michael Zuker's MULFOLD rna folding software, Olsen from  
> Gary Olsen's
> phylogeny software).
>
> Please use readseq as desired,
> Don Gilbert
>
> oat.% readseq
> Readseq version 2.1.24 (24-May-2006)
>
>   Readseq version 2.1.24 (24-May-2006)
>
>   Read & reformat biosequences, Java command-line version
>   Usage: java -cp readseq.jar run [options] input-file(s)
>   For more details: java -cp readseq.jar help more
>
>   Options
>     -a[ll]              select All sequences
>     -c[aselower]        change to lower case
>     -C[ASEUPPER]        change to UPPER CASE
>     -ch[ecksum]         calculate & print checksum of sequences
>     -degap[=-]          remove gap symbols
>     -f[ormat=]#         Format number for output,  or
>     -f[ormat=]Name      Format name for output
>           see Formats   list below for names and numbers
>     -inform[at]=#       input format number,  or
>     -inform[at]=Name    input format name.  Assume input data is  
> this format
>     -i[tem=2,3,4]       select Item number(s) from several
>     -l[ist]             List sequences only
>     -o[utput=]out.seq   redirect Output
>     -p[ipe]             Pipe (command line, < stdin, > stdout)
>     -r[everse]          reverse-complement of input sequence
>     -t[ranslate=]io     translate input symbol [i] to output symbol  
> [o]
>                         use several -tio to translate several symbols
>     -v[erbose]          Verbose progress
>     -compare=1          Compare two sequence files, reporting  
> differences (flags=nodoc,noid,nolen,nocrc)
>     -amino[translate]   translate dna to amino acids
>
>    Documentation and Feature Table extraction:
>     -feat[ures]=exon,CDS...   extract sequence of selected features
>     -nofeat[ures]=repeat_region,intron... remove sequence of  
> selected features
>     -field=AC,ID...      include selected document fields in output
>     -nofield=COMMENT,... remove selected document fields from output
>
>     -extract=1000..9999  * extract all features, sequence from  
> given base range
>     -subrange=-1000..10  * extract subrange of sequence for feature  
> locations
>     -subrange=1..end
>     -subrange=end-10..end+99
>     -pair=1              * combine features (fff,gff) and sequence  
> files to one output
>     -unpair=1            * split features,sequence from one input  
> to two files
>
>    Pretty format options:
>     -wid[th]=#            sequence line width
>     -tab=#                left indent
>     -col[space]=#         column space within sequence line on output
>     -gap[count]           count gap chars in sequence numbers
>     -nameleft, -nameright[=#]   name on left/right side [=max width]
>     -nametop              name at top/bottom
>     -numleft, -numright   seq index on left/right side
>     -numtop, -numbot      index on top/bottom
>     -match[=.]            use match base for 2..n species
>     -inter[line=#]        blank line(s) between sequence blocks
>
> This program requires a Java runtime (java or jre) program, version  
> 1.1.x, 1.2 or later
> The leading '-' on option is optional if '=' is present.  All non- 
> options
> (no leading '-' or embedded '=') are used as input file names.
> These options and call format are compatible with the classic  
> readseq (v.1992)
> * New experimental feature handling options, may not yet work as  
> desired.
> To test readeq, use: java -cp readseq.jar test
>
>   Known biosequence formats:
>  ID  Name             Read  Write  Int'leaf  Features  Sequence   
> Suffix  Content-type
>   1  IG|Stanford      yes    yes        --        --        
> yes   .ig     biosequence/ig
>   2  GenBank|gb       yes    yes        --       yes        
> yes   .gb     biosequence/genbank
>   3  NBRF             yes    yes        --        --        
> yes   .nbrf   biosequence/nbrf
>   4  EMBL|em          yes    yes        --       yes        
> yes   .embl   biosequence/embl
>   5  GCG              yes    yes        --        --        
> yes   .gcg    biosequence/gcg
>   6  DNAStrider       yes    yes        --        --        
> yes   .strider  biosequence/strider
>   7  Fitch             --     --        --        --        
> yes   .fitch  biosequence/fitch
>   8  Pearson|Fasta|fa   yes    yes        --        --        
> yes   .fasta  biosequence/fasta
>   9  Zuker             --     --        --        --        
> yes   .zuker  biosequence/zuker
>  10  Olsen             --     --       yes        --        
> yes   .olsen  biosequence/olsen
>  11  Phylip3.2        yes    yes       yes        --        
> yes   .phylip2  biosequence/phylip2
>  12  Phylip|Phylip4   yes    yes       yes        --        
> yes   .phylip  biosequence/phylip
>  13  Plain|Raw        yes    yes        --        --        
> yes   .seq    biosequence/plain
>  14  PIR|CODATA       yes    yes        --        --        
> yes   .pir    biosequence/codata
>  15  MSF              yes    yes       yes        --        
> yes   .msf    biosequence/msf
>  16  ASN.1             --     --        --        --        
> yes   .asn    biosequence/asn1
>  17  PAUP|NEXUS       yes    yes       yes        --        
> yes   .nexus  biosequence/nexus
>  18  Pretty            --    yes       yes        --        
> yes   .pretty  biosequence/pretty
>  19  XML              yes    yes        --       yes        
> yes   .xml    biosequence/xml
>  20  BLAST            yes     --       yes        --        
> yes   .blast  biosequence/blast
>  21  SCF              yes     --        --        --        
> yes   .scf    biosequence/scf
>  22  Clustal          yes    yes       yes        --        
> yes   .aln    biosequence/clustal
>  23  FlatFeat|FFF     yes    yes        --       yes         
> --   .fff    biosequence/fff
>  24  GFF              yes    yes        --       yes         
> --   .gff    biosequence/gff
>  25  ACEDB            yes    yes        --        --        
> yes   .ace    biosequence/acedb
>    (Int'leaf = interleaved format; Features = documentation/ 
> features are parsed)
>
> -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
> -- gilbertd at indiana.edu -- http://marmot.bio.indiana.edu/
>
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>
>
> ___________________________________________________________
> $0 Web Hosting with up to 200MB web space, 1000 MB Transfer
> 10 Personalized POP and Web E-mail Accounts, and much more.
> Signup at www.doteasy.com
>

*********************************************************
                     ** Alexander Griekspoor **
*********************************************************
              The Netherlands Cancer Institute
              Department of Tumorbiology (H4)
         Plesmanlaan 121, 1066 CX, Amsterdam
                    Tel:  + 31 20 - 512 2023
                    Fax:  + 31 20 - 512 2029
                    AIM: mekentosj at mac.com
                    E-mail: a.griekspoor at nki.nl
                Web: http://www.mekentosj.com

       Microsoft is not the answer,
       Microsoft is the question,
       NO is the answer

*********************************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20060719/93779fd7/attachment.html>