[Biococoa-dev] Even more on sequence formats

don gilbert gilbertd at indiana.edu
Fri Jul 14 13:16:34 EDT 2006


Dear biococoa folks,

 > Finally, to come back to the question on the formats, perhaps we
 > can learn from a classic sequence reader package called ReadSeq by
 > d.g.gilbert.
 > I'm not sure where it can be found nowadays, so I put it
 > temporarily on our server for you guys to download:

Readseq is available as it has been for ~ 15 years from
ftp://iubio.bio.indiana.edu/molbio/readseq/  but see the readseq/java/ 
version
which I still update, but haven't had time recently to add new formats. 
  The
java version has more/better formats and documentation (and parser 
fixes),
but is also more complex than the C version.

The help document you list is from the 1992 C version, which I don't 
update.
Also,
----------------
PUBLIC DOMAIN NOTICE:
This software is freely available to the public for use. The
author, Don Gilbert, of Readseq and the Java package
'iubio.readseq' does not place any restriction on its use or
reproduction. Developers are encourged to incorporate parts in
their programs. I would appreciate being cited in any work or
product based on this material. This software is provided without
warranty of any kind.
------------------

 > I guess we have most of them, except for IG, NBRF, Fitch, Zuker,
 > Olsen, ASN.1.
With exception of NCBI's ASN.1, which requires also the NCBI toolkit
linked in with readseq, the others are essentially obsolete unless one
still uses old 1990's era molbio software (IG = Intelligenetics; NBRF = 
a PIR variant;
Fitch from some classic molbio software (1970s? or 1980s) but a poor 
format;
Zuker from Michael Zuker's MULFOLD rna folding software, Olsen from 
Gary Olsen's
phylogeny software).

Please use readseq as desired,
Don Gilbert

oat.% readseq
Readseq version 2.1.24 (24-May-2006)

   Readseq version 2.1.24 (24-May-2006)

   Read & reformat biosequences, Java command-line version
   Usage: java -cp readseq.jar run [options] input-file(s)
   For more details: java -cp readseq.jar help more

   Options
     -a[ll]              select All sequences
     -c[aselower]        change to lower case
     -C[ASEUPPER]        change to UPPER CASE
     -ch[ecksum]         calculate & print checksum of sequences
     -degap[=-]          remove gap symbols
     -f[ormat=]#         Format number for output,  or
     -f[ormat=]Name      Format name for output
           see Formats   list below for names and numbers
     -inform[at]=#       input format number,  or
     -inform[at]=Name    input format name.  Assume input data is this 
format
     -i[tem=2,3,4]       select Item number(s) from several
     -l[ist]             List sequences only
     -o[utput=]out.seq   redirect Output
     -p[ipe]             Pipe (command line, < stdin, > stdout)
     -r[everse]          reverse-complement of input sequence
     -t[ranslate=]io     translate input symbol [i] to output symbol [o]
                         use several -tio to translate several symbols
     -v[erbose]          Verbose progress
     -compare=1          Compare two sequence files, reporting 
differences (flags=nodoc,noid,nolen,nocrc)
     -amino[translate]   translate dna to amino acids

    Documentation and Feature Table extraction:
     -feat[ures]=exon,CDS...   extract sequence of selected features
     -nofeat[ures]=repeat_region,intron... remove sequence of selected 
features
     -field=AC,ID...      include selected document fields in output
     -nofield=COMMENT,... remove selected document fields from output

     -extract=1000..9999  * extract all features, sequence from given 
base range
     -subrange=-1000..10  * extract subrange of sequence for feature 
locations
     -subrange=1..end
     -subrange=end-10..end+99
     -pair=1              * combine features (fff,gff) and sequence 
files to one output
     -unpair=1            * split features,sequence from one input to 
two files

    Pretty format options:
     -wid[th]=#            sequence line width
     -tab=#                left indent
     -col[space]=#         column space within sequence line on output
     -gap[count]           count gap chars in sequence numbers
     -nameleft, -nameright[=#]   name on left/right side [=max width]
     -nametop              name at top/bottom
     -numleft, -numright   seq index on left/right side
     -numtop, -numbot      index on top/bottom
     -match[=.]            use match base for 2..n species
     -inter[line=#]        blank line(s) between sequence blocks

This program requires a Java runtime (java or jre) program, version 
1.1.x, 1.2 or later
The leading '-' on option is optional if '=' is present.  All 
non-options
(no leading '-' or embedded '=') are used as input file names.
These options and call format are compatible with the classic readseq 
(v.1992)
* New experimental feature handling options, may not yet work as 
desired.
To test readeq, use: java -cp readseq.jar test

   Known biosequence formats:
  ID  Name             Read  Write  Int'leaf  Features  Sequence  Suffix 
  Content-type
   1  IG|Stanford      yes    yes        --        --       yes   .ig    
  biosequence/ig
   2  GenBank|gb       yes    yes        --       yes       yes   .gb    
  biosequence/genbank
   3  NBRF             yes    yes        --        --       yes   .nbrf  
  biosequence/nbrf
   4  EMBL|em          yes    yes        --       yes       yes   .embl  
  biosequence/embl
   5  GCG              yes    yes        --        --       yes   .gcg   
  biosequence/gcg
   6  DNAStrider       yes    yes        --        --       yes   
.strider  biosequence/strider
   7  Fitch             --     --        --        --       yes   .fitch 
  biosequence/fitch
   8  Pearson|Fasta|fa   yes    yes        --        --       yes   
.fasta  biosequence/fasta
   9  Zuker             --     --        --        --       yes   .zuker 
  biosequence/zuker
  10  Olsen             --     --       yes        --       yes   .olsen 
  biosequence/olsen
  11  Phylip3.2        yes    yes       yes        --       yes   
.phylip2  biosequence/phylip2
  12  Phylip|Phylip4   yes    yes       yes        --       yes   
.phylip  biosequence/phylip
  13  Plain|Raw        yes    yes        --        --       yes   .seq   
  biosequence/plain
  14  PIR|CODATA       yes    yes        --        --       yes   .pir   
  biosequence/codata
  15  MSF              yes    yes       yes        --       yes   .msf   
  biosequence/msf
  16  ASN.1             --     --        --        --       yes   .asn   
  biosequence/asn1
  17  PAUP|NEXUS       yes    yes       yes        --       yes   .nexus 
  biosequence/nexus
  18  Pretty            --    yes       yes        --       yes   
.pretty  biosequence/pretty
  19  XML              yes    yes        --       yes       yes   .xml   
  biosequence/xml
  20  BLAST            yes     --       yes        --       yes   .blast 
  biosequence/blast
  21  SCF              yes     --        --        --       yes   .scf   
  biosequence/scf
  22  Clustal          yes    yes       yes        --       yes   .aln   
  biosequence/clustal
  23  FlatFeat|FFF     yes    yes        --       yes        --   .fff   
  biosequence/fff
  24  GFF              yes    yes        --       yes        --   .gff   
  biosequence/gff
  25  ACEDB            yes    yes        --        --       yes   .ace   
  biosequence/acedb
    (Int'leaf = interleaved format; Features = documentation/features 
are parsed)

-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd at indiana.edu -- http://marmot.bio.indiana.edu/




More information about the Biococoa-dev mailing list