[Biococoa-dev] Even more on sequence formats
Alexander Griekspoor
mekentosj at gmail.com
Wed Jul 19 06:05:56 EDT 2006
Hi Don,
Many thanks for your comments and suggestions! Peter had already
implemented most of the popular sequence formats in the original
BioCocoa framework, but we'll have a look at your Java framework to
see if we could add or improve relevant sequence formats in BioCocoa.
Again, many thanks for the feedback and keep up the good work!
Best wishes,
Alex
On 14-jul-2006, at 19:16, don gilbert wrote:
> Dear biococoa folks,
>
> > Finally, to come back to the question on the formats, perhaps we
> > can learn from a classic sequence reader package called ReadSeq by
> > d.g.gilbert.
> > I'm not sure where it can be found nowadays, so I put it
> > temporarily on our server for you guys to download:
>
> Readseq is available as it has been for ~ 15 years from
> ftp://iubio.bio.indiana.edu/molbio/readseq/ but see the readseq/
> java/ version
> which I still update, but haven't had time recently to add new
> formats. The
> java version has more/better formats and documentation (and parser
> fixes),
> but is also more complex than the C version.
>
> The help document you list is from the 1992 C version, which I
> don't update.
> Also,
> ----------------
> PUBLIC DOMAIN NOTICE:
> This software is freely available to the public for use. The
> author, Don Gilbert, of Readseq and the Java package
> 'iubio.readseq' does not place any restriction on its use or
> reproduction. Developers are encourged to incorporate parts in
> their programs. I would appreciate being cited in any work or
> product based on this material. This software is provided without
> warranty of any kind.
> ------------------
>
> > I guess we have most of them, except for IG, NBRF, Fitch, Zuker,
> > Olsen, ASN.1.
> With exception of NCBI's ASN.1, which requires also the NCBI toolkit
> linked in with readseq, the others are essentially obsolete unless one
> still uses old 1990's era molbio software (IG = Intelligenetics;
> NBRF = a PIR variant;
> Fitch from some classic molbio software (1970s? or 1980s) but a
> poor format;
> Zuker from Michael Zuker's MULFOLD rna folding software, Olsen from
> Gary Olsen's
> phylogeny software).
>
> Please use readseq as desired,
> Don Gilbert
>
> oat.% readseq
> Readseq version 2.1.24 (24-May-2006)
>
> Readseq version 2.1.24 (24-May-2006)
>
> Read & reformat biosequences, Java command-line version
> Usage: java -cp readseq.jar run [options] input-file(s)
> For more details: java -cp readseq.jar help more
>
> Options
> -a[ll] select All sequences
> -c[aselower] change to lower case
> -C[ASEUPPER] change to UPPER CASE
> -ch[ecksum] calculate & print checksum of sequences
> -degap[=-] remove gap symbols
> -f[ormat=]# Format number for output, or
> -f[ormat=]Name Format name for output
> see Formats list below for names and numbers
> -inform[at]=# input format number, or
> -inform[at]=Name input format name. Assume input data is
> this format
> -i[tem=2,3,4] select Item number(s) from several
> -l[ist] List sequences only
> -o[utput=]out.seq redirect Output
> -p[ipe] Pipe (command line, < stdin, > stdout)
> -r[everse] reverse-complement of input sequence
> -t[ranslate=]io translate input symbol [i] to output symbol
> [o]
> use several -tio to translate several symbols
> -v[erbose] Verbose progress
> -compare=1 Compare two sequence files, reporting
> differences (flags=nodoc,noid,nolen,nocrc)
> -amino[translate] translate dna to amino acids
>
> Documentation and Feature Table extraction:
> -feat[ures]=exon,CDS... extract sequence of selected features
> -nofeat[ures]=repeat_region,intron... remove sequence of
> selected features
> -field=AC,ID... include selected document fields in output
> -nofield=COMMENT,... remove selected document fields from output
>
> -extract=1000..9999 * extract all features, sequence from
> given base range
> -subrange=-1000..10 * extract subrange of sequence for feature
> locations
> -subrange=1..end
> -subrange=end-10..end+99
> -pair=1 * combine features (fff,gff) and sequence
> files to one output
> -unpair=1 * split features,sequence from one input
> to two files
>
> Pretty format options:
> -wid[th]=# sequence line width
> -tab=# left indent
> -col[space]=# column space within sequence line on output
> -gap[count] count gap chars in sequence numbers
> -nameleft, -nameright[=#] name on left/right side [=max width]
> -nametop name at top/bottom
> -numleft, -numright seq index on left/right side
> -numtop, -numbot index on top/bottom
> -match[=.] use match base for 2..n species
> -inter[line=#] blank line(s) between sequence blocks
>
> This program requires a Java runtime (java or jre) program, version
> 1.1.x, 1.2 or later
> The leading '-' on option is optional if '=' is present. All non-
> options
> (no leading '-' or embedded '=') are used as input file names.
> These options and call format are compatible with the classic
> readseq (v.1992)
> * New experimental feature handling options, may not yet work as
> desired.
> To test readeq, use: java -cp readseq.jar test
>
> Known biosequence formats:
> ID Name Read Write Int'leaf Features Sequence
> Suffix Content-type
> 1 IG|Stanford yes yes -- --
> yes .ig biosequence/ig
> 2 GenBank|gb yes yes -- yes
> yes .gb biosequence/genbank
> 3 NBRF yes yes -- --
> yes .nbrf biosequence/nbrf
> 4 EMBL|em yes yes -- yes
> yes .embl biosequence/embl
> 5 GCG yes yes -- --
> yes .gcg biosequence/gcg
> 6 DNAStrider yes yes -- --
> yes .strider biosequence/strider
> 7 Fitch -- -- -- --
> yes .fitch biosequence/fitch
> 8 Pearson|Fasta|fa yes yes -- --
> yes .fasta biosequence/fasta
> 9 Zuker -- -- -- --
> yes .zuker biosequence/zuker
> 10 Olsen -- -- yes --
> yes .olsen biosequence/olsen
> 11 Phylip3.2 yes yes yes --
> yes .phylip2 biosequence/phylip2
> 12 Phylip|Phylip4 yes yes yes --
> yes .phylip biosequence/phylip
> 13 Plain|Raw yes yes -- --
> yes .seq biosequence/plain
> 14 PIR|CODATA yes yes -- --
> yes .pir biosequence/codata
> 15 MSF yes yes yes --
> yes .msf biosequence/msf
> 16 ASN.1 -- -- -- --
> yes .asn biosequence/asn1
> 17 PAUP|NEXUS yes yes yes --
> yes .nexus biosequence/nexus
> 18 Pretty -- yes yes --
> yes .pretty biosequence/pretty
> 19 XML yes yes -- yes
> yes .xml biosequence/xml
> 20 BLAST yes -- yes --
> yes .blast biosequence/blast
> 21 SCF yes -- -- --
> yes .scf biosequence/scf
> 22 Clustal yes yes yes --
> yes .aln biosequence/clustal
> 23 FlatFeat|FFF yes yes -- yes
> -- .fff biosequence/fff
> 24 GFF yes yes -- yes
> -- .gff biosequence/gff
> 25 ACEDB yes yes -- --
> yes .ace biosequence/acedb
> (Int'leaf = interleaved format; Features = documentation/
> features are parsed)
>
> -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
> -- gilbertd at indiana.edu -- http://marmot.bio.indiana.edu/
>
> _______________________________________________
> Biococoa-dev mailing list
> Biococoa-dev at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biococoa-dev
>
>
> ___________________________________________________________
> $0 Web Hosting with up to 200MB web space, 1000 MB Transfer
> 10 Personalized POP and Web E-mail Accounts, and much more.
> Signup at www.doteasy.com
>
*********************************************************
** Alexander Griekspoor **
*********************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
Microsoft is not the answer,
Microsoft is the question,
NO is the answer
*********************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20060719/93779fd7/attachment.html>
More information about the Biococoa-dev
mailing list