[Biococoa-dev] Even more on sequence formats
don gilbert
gilbertd at indiana.edu
Fri Jul 14 13:16:34 EDT 2006
Dear biococoa folks,
> Finally, to come back to the question on the formats, perhaps we
> can learn from a classic sequence reader package called ReadSeq by
> d.g.gilbert.
> I'm not sure where it can be found nowadays, so I put it
> temporarily on our server for you guys to download:
Readseq is available as it has been for ~ 15 years from
ftp://iubio.bio.indiana.edu/molbio/readseq/ but see the readseq/java/
version
which I still update, but haven't had time recently to add new formats.
The
java version has more/better formats and documentation (and parser
fixes),
but is also more complex than the C version.
The help document you list is from the 1992 C version, which I don't
update.
Also,
----------------
PUBLIC DOMAIN NOTICE:
This software is freely available to the public for use. The
author, Don Gilbert, of Readseq and the Java package
'iubio.readseq' does not place any restriction on its use or
reproduction. Developers are encourged to incorporate parts in
their programs. I would appreciate being cited in any work or
product based on this material. This software is provided without
warranty of any kind.
------------------
> I guess we have most of them, except for IG, NBRF, Fitch, Zuker,
> Olsen, ASN.1.
With exception of NCBI's ASN.1, which requires also the NCBI toolkit
linked in with readseq, the others are essentially obsolete unless one
still uses old 1990's era molbio software (IG = Intelligenetics; NBRF =
a PIR variant;
Fitch from some classic molbio software (1970s? or 1980s) but a poor
format;
Zuker from Michael Zuker's MULFOLD rna folding software, Olsen from
Gary Olsen's
phylogeny software).
Please use readseq as desired,
Don Gilbert
oat.% readseq
Readseq version 2.1.24 (24-May-2006)
Readseq version 2.1.24 (24-May-2006)
Read & reformat biosequences, Java command-line version
Usage: java -cp readseq.jar run [options] input-file(s)
For more details: java -cp readseq.jar help more
Options
-a[ll] select All sequences
-c[aselower] change to lower case
-C[ASEUPPER] change to UPPER CASE
-ch[ecksum] calculate & print checksum of sequences
-degap[=-] remove gap symbols
-f[ormat=]# Format number for output, or
-f[ormat=]Name Format name for output
see Formats list below for names and numbers
-inform[at]=# input format number, or
-inform[at]=Name input format name. Assume input data is this
format
-i[tem=2,3,4] select Item number(s) from several
-l[ist] List sequences only
-o[utput=]out.seq redirect Output
-p[ipe] Pipe (command line, < stdin, > stdout)
-r[everse] reverse-complement of input sequence
-t[ranslate=]io translate input symbol [i] to output symbol [o]
use several -tio to translate several symbols
-v[erbose] Verbose progress
-compare=1 Compare two sequence files, reporting
differences (flags=nodoc,noid,nolen,nocrc)
-amino[translate] translate dna to amino acids
Documentation and Feature Table extraction:
-feat[ures]=exon,CDS... extract sequence of selected features
-nofeat[ures]=repeat_region,intron... remove sequence of selected
features
-field=AC,ID... include selected document fields in output
-nofield=COMMENT,... remove selected document fields from output
-extract=1000..9999 * extract all features, sequence from given
base range
-subrange=-1000..10 * extract subrange of sequence for feature
locations
-subrange=1..end
-subrange=end-10..end+99
-pair=1 * combine features (fff,gff) and sequence
files to one output
-unpair=1 * split features,sequence from one input to
two files
Pretty format options:
-wid[th]=# sequence line width
-tab=# left indent
-col[space]=# column space within sequence line on output
-gap[count] count gap chars in sequence numbers
-nameleft, -nameright[=#] name on left/right side [=max width]
-nametop name at top/bottom
-numleft, -numright seq index on left/right side
-numtop, -numbot index on top/bottom
-match[=.] use match base for 2..n species
-inter[line=#] blank line(s) between sequence blocks
This program requires a Java runtime (java or jre) program, version
1.1.x, 1.2 or later
The leading '-' on option is optional if '=' is present. All
non-options
(no leading '-' or embedded '=') are used as input file names.
These options and call format are compatible with the classic readseq
(v.1992)
* New experimental feature handling options, may not yet work as
desired.
To test readeq, use: java -cp readseq.jar test
Known biosequence formats:
ID Name Read Write Int'leaf Features Sequence Suffix
Content-type
1 IG|Stanford yes yes -- -- yes .ig
biosequence/ig
2 GenBank|gb yes yes -- yes yes .gb
biosequence/genbank
3 NBRF yes yes -- -- yes .nbrf
biosequence/nbrf
4 EMBL|em yes yes -- yes yes .embl
biosequence/embl
5 GCG yes yes -- -- yes .gcg
biosequence/gcg
6 DNAStrider yes yes -- -- yes
.strider biosequence/strider
7 Fitch -- -- -- -- yes .fitch
biosequence/fitch
8 Pearson|Fasta|fa yes yes -- -- yes
.fasta biosequence/fasta
9 Zuker -- -- -- -- yes .zuker
biosequence/zuker
10 Olsen -- -- yes -- yes .olsen
biosequence/olsen
11 Phylip3.2 yes yes yes -- yes
.phylip2 biosequence/phylip2
12 Phylip|Phylip4 yes yes yes -- yes
.phylip biosequence/phylip
13 Plain|Raw yes yes -- -- yes .seq
biosequence/plain
14 PIR|CODATA yes yes -- -- yes .pir
biosequence/codata
15 MSF yes yes yes -- yes .msf
biosequence/msf
16 ASN.1 -- -- -- -- yes .asn
biosequence/asn1
17 PAUP|NEXUS yes yes yes -- yes .nexus
biosequence/nexus
18 Pretty -- yes yes -- yes
.pretty biosequence/pretty
19 XML yes yes -- yes yes .xml
biosequence/xml
20 BLAST yes -- yes -- yes .blast
biosequence/blast
21 SCF yes -- -- -- yes .scf
biosequence/scf
22 Clustal yes yes yes -- yes .aln
biosequence/clustal
23 FlatFeat|FFF yes yes -- yes -- .fff
biosequence/fff
24 GFF yes yes -- yes -- .gff
biosequence/gff
25 ACEDB yes yes -- -- yes .ace
biosequence/acedb
(Int'leaf = interleaved format; Features = documentation/features
are parsed)
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd at indiana.edu -- http://marmot.bio.indiana.edu/
More information about the Biococoa-dev
mailing list