[Biococoa-dev] Even more on sequence formats

don gilbert gilbertd at indiana.edu
Fri Jul 14 13:16:34 EDT 2006

Dear biococoa folks,

 > Finally, to come back to the question on the formats, perhaps we
 > can learn from a classic sequence reader package called ReadSeq by
 > d.g.gilbert.
 > I'm not sure where it can be found nowadays, so I put it
 > temporarily on our server for you guys to download:

Readseq is available as it has been for ~ 15 years from
ftp://iubio.bio.indiana.edu/molbio/readseq/  but see the readseq/java/ 
which I still update, but haven't had time recently to add new formats. 
java version has more/better formats and documentation (and parser 
but is also more complex than the C version.

The help document you list is from the 1992 C version, which I don't 
This software is freely available to the public for use. The
author, Don Gilbert, of Readseq and the Java package
'iubio.readseq' does not place any restriction on its use or
reproduction. Developers are encourged to incorporate parts in
their programs. I would appreciate being cited in any work or
product based on this material. This software is provided without
warranty of any kind.

 > I guess we have most of them, except for IG, NBRF, Fitch, Zuker,
 > Olsen, ASN.1.
With exception of NCBI's ASN.1, which requires also the NCBI toolkit
linked in with readseq, the others are essentially obsolete unless one
still uses old 1990's era molbio software (IG = Intelligenetics; NBRF = 
a PIR variant;
Fitch from some classic molbio software (1970s? or 1980s) but a poor 
Zuker from Michael Zuker's MULFOLD rna folding software, Olsen from 
Gary Olsen's
phylogeny software).

Please use readseq as desired,
Don Gilbert

oat.% readseq
Readseq version 2.1.24 (24-May-2006)

   Readseq version 2.1.24 (24-May-2006)

   Read & reformat biosequences, Java command-line version
   Usage: java -cp readseq.jar run [options] input-file(s)
   For more details: java -cp readseq.jar help more

     -a[ll]              select All sequences
     -c[aselower]        change to lower case
     -C[ASEUPPER]        change to UPPER CASE
     -ch[ecksum]         calculate & print checksum of sequences
     -degap[=-]          remove gap symbols
     -f[ormat=]#         Format number for output,  or
     -f[ormat=]Name      Format name for output
           see Formats   list below for names and numbers
     -inform[at]=#       input format number,  or
     -inform[at]=Name    input format name.  Assume input data is this 
     -i[tem=2,3,4]       select Item number(s) from several
     -l[ist]             List sequences only
     -o[utput=]out.seq   redirect Output
     -p[ipe]             Pipe (command line, < stdin, > stdout)
     -r[everse]          reverse-complement of input sequence
     -t[ranslate=]io     translate input symbol [i] to output symbol [o]
                         use several -tio to translate several symbols
     -v[erbose]          Verbose progress
     -compare=1          Compare two sequence files, reporting 
differences (flags=nodoc,noid,nolen,nocrc)
     -amino[translate]   translate dna to amino acids

    Documentation and Feature Table extraction:
     -feat[ures]=exon,CDS...   extract sequence of selected features
     -nofeat[ures]=repeat_region,intron... remove sequence of selected 
     -field=AC,ID...      include selected document fields in output
     -nofield=COMMENT,... remove selected document fields from output

     -extract=1000..9999  * extract all features, sequence from given 
base range
     -subrange=-1000..10  * extract subrange of sequence for feature 
     -pair=1              * combine features (fff,gff) and sequence 
files to one output
     -unpair=1            * split features,sequence from one input to 
two files

    Pretty format options:
     -wid[th]=#            sequence line width
     -tab=#                left indent
     -col[space]=#         column space within sequence line on output
     -gap[count]           count gap chars in sequence numbers
     -nameleft, -nameright[=#]   name on left/right side [=max width]
     -nametop              name at top/bottom
     -numleft, -numright   seq index on left/right side
     -numtop, -numbot      index on top/bottom
     -match[=.]            use match base for 2..n species
     -inter[line=#]        blank line(s) between sequence blocks

This program requires a Java runtime (java or jre) program, version 
1.1.x, 1.2 or later
The leading '-' on option is optional if '=' is present.  All 
(no leading '-' or embedded '=') are used as input file names.
These options and call format are compatible with the classic readseq 
* New experimental feature handling options, may not yet work as 
To test readeq, use: java -cp readseq.jar test

   Known biosequence formats:
  ID  Name             Read  Write  Int'leaf  Features  Sequence  Suffix 
   1  IG|Stanford      yes    yes        --        --       yes   .ig    
   2  GenBank|gb       yes    yes        --       yes       yes   .gb    
   3  NBRF             yes    yes        --        --       yes   .nbrf  
   4  EMBL|em          yes    yes        --       yes       yes   .embl  
   5  GCG              yes    yes        --        --       yes   .gcg   
   6  DNAStrider       yes    yes        --        --       yes   
.strider  biosequence/strider
   7  Fitch             --     --        --        --       yes   .fitch 
   8  Pearson|Fasta|fa   yes    yes        --        --       yes   
.fasta  biosequence/fasta
   9  Zuker             --     --        --        --       yes   .zuker 
  10  Olsen             --     --       yes        --       yes   .olsen 
  11  Phylip3.2        yes    yes       yes        --       yes   
.phylip2  biosequence/phylip2
  12  Phylip|Phylip4   yes    yes       yes        --       yes   
.phylip  biosequence/phylip
  13  Plain|Raw        yes    yes        --        --       yes   .seq   
  14  PIR|CODATA       yes    yes        --        --       yes   .pir   
  15  MSF              yes    yes       yes        --       yes   .msf   
  16  ASN.1             --     --        --        --       yes   .asn   
  17  PAUP|NEXUS       yes    yes       yes        --       yes   .nexus 
  18  Pretty            --    yes       yes        --       yes   
.pretty  biosequence/pretty
  19  XML              yes    yes        --       yes       yes   .xml   
  20  BLAST            yes     --       yes        --       yes   .blast 
  21  SCF              yes     --        --        --       yes   .scf   
  22  Clustal          yes    yes       yes        --       yes   .aln   
  23  FlatFeat|FFF     yes    yes        --       yes        --   .fff   
  24  GFF              yes    yes        --       yes        --   .gff   
  25  ACEDB            yes    yes        --        --       yes   .ace   
    (Int'leaf = interleaved format; Features = documentation/features 
are parsed)

-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd at indiana.edu -- http://marmot.bio.indiana.edu/

More information about the Biococoa-dev mailing list