Genpak

Genpak

April 2000

NAME

Genpak - utilities to manipulate DNA sequences

Copyright (C) 2000 January Weiner III <jw3@gyral.com>

Genpak homepage:

http://www.rzuser.uni-heidelberg.de/~jweiner1/Genpak)

WARNING!

This is a prerelease only. The goal for me is to see whether anyone is interested in this type of tools. Some files are still missing, others don't work yet OK. In a few days I will post an update on freshmeat.

LICENSE

Genpak is GPL'ed. Please read the file LICENSE.TXT for details.

DESCRIPTION

Genpak is a set of small utilities written in ANSI C to manipulate DNA sequences in a Unix fashion, fit for combining within shell and cgi scripts. Some exemplary cgi scripts are provided in the cgi directory. I have done this utilities for myself and found them very useful for my work; they are fast and quite reliable, and playing with large numbers of sequences is much more convenient then with standard GUI tools. Feel free to mail me bug reports and suggestions.

The sequences are usually in fasta format, that means the first line is the sequence name starting with ">", and the sequence comes in the next lines.

Upon installation, Genpak creates a directory where it stores all it's data. As a default, it is the /usr/lib/genpak directory. If one of the programs cannot find a file which is given as the argument, it looks for it in this particular directory, and only if it is not there it exits with an error. You can put some shared files into this directory; note that they will not be erased upon deinstallation or reinstallation. However, in the latter case they might well get overwritten if you substituted the original Genpak files by your own.

All programs share some common options:

  • -h prints out a quick summary of options

  • -v prints version information

  • -q supresses all error messages ("quiet")

  • -d prints out debugging information

  • Most programs accept also standard input (that was one of the main points why I wrote those utilities anyway), and per default spawn the results to standard output. This way, you have several methods of accessing the programs:

    cat sequence.fasta | program > program.output

    some_other_program | program | yet_another_program

    program input.file output.file

    program

    In the latter case, you have to type in any data the program expects to find on the standard input, and the program spawns the processed data directly on the screen.

    In most cases, you can use multiple sequences stored in one file in a fasta format fashion. The programs which require a sequence file will work until all the sequences that can be retrieved from an input (=file or standard input) are processed.

    LIST OF PROGRAMS

  • gp_qs
  • gp_getseq
  • gp_gc
  • gp_tm
  • gp_matrix
  • gp_randseq
  • gp_cusage
  • gp_seq2prot
  • gp_findorf
  • gp_slen
  • pars
  • Here are the short program descriptions. Take a look at their respective manual pages or html documentation to obtain more informations.

  • gp_qs

    find fast a sequence within a larger sequence, and print out the positions. Sometimes you just don't need blasta -- like, when you want only to know where exactly your primer binds in a given sequence. You can either type the sequence directly as a command line argument, like

    gp_qs ACTGACTG [sequence filename]

    or give a filename in command line as an argument.

  • gp_getseq

    retrieves quickly a sequence fragment. Usage is simple: gp_getseq Position1 Position2 [sequence filename] Note that if Position2 > Position1, the retrieved sequence is complementary to the fragment Position1...Position2. Position1 is the number of the first base to be retrieved, and Position2 is the last base to be retrieved.

  • gp_gc

    Prints out the GC content of a given sequence or sequences. Can also computate mean and SE for larger number of sequences.

  • gp_tm

    Prints out the Tm of a given sequence. Note however, that the algorythm used currently by gp_tm is "4* GC + 2* AT", which is misleading and not exact. I'm working on a better version using the nearest neigbourgh method.

  • gp_matrix

    Matrix is a program to look for promoters in a set of sequence files, using the Staden matrix (see: Hertz, G. and Stormo, G.D. 1996. Escherichia coli promoter sequences: analysis and prediction. Meth. Enzym. 273). Basically, you have a matrix file containing scores and penalties for nucleotides at different positions in the supposed -35 and -10 boxes, as well in the +1 region of a given sequence (see the file "matryca" in the data/ directory, which is the same as the E. coli matrix published in Hertz et al.).

    The program loads sequences from the sequence file, and then scans it using all possible combinations of gap lengths between the +1, -10 and -35 boxes and at all possible positions in the sequence so as to find this combination which gives the highest score for the sequence. It then prints a formatted output in the following form:

    #score sequence...[-35 core]...[-10 core]...[start]...

    The '|' characters denote the boundaries of matrix'ed fragments.

    In the "data" directory you will find the original Staden E. coli matrix.

  • gp_randseq

    unless the option -r is set, it prints out random fragments from a sequence file. Default fragment length is 100, and you can change it with the option -l length. If you set -r, however, completly random sequences are provided. You can determine their GC content with the option -g value. There is also an option -m, which stands for "Markov chains", but all it does is to assure that the probability of selecting a nucleotide depends on what is the previous nucleotide; this probabilities are also taken out from a sequence file.

  • gp_seq2prot

    Converts a nucleotide sequence to protein sequence. Sequence is supposed to start with a start codon: this is mandatory. Lacking of the stop codon or premature end of input sequence (like, in the middle of a codon) results only in a warning message.

    You can provide your own codon tables; for the format of the codon_file look at data/standard.cdn and data/myco.cdn. Basically, you need not to provide the whole table, it is enough to point out the differences. To see a codon file, type gp_seq2prot -p.

  • gp_findorf

    Prints out all ORFs that are contained in a sequence. gp_findorf looks always for the longest ORF within the given limit. See also notes for gp_seq2prot.

  • gp_cusage

    Prints out the codon usage of sequence(s). Same options as in the case of gp_seq2prot; actually -- this *is* nearly the same program. I just like them to have separately.

  • gp_slen

    Sequence length. Sometimes useful. Can also computate mean and SE.

  • pars

    This program shows that I'm hopeless and don't know anything about Un*x tools. All pars does is to change the "%0D%0A" string into a newline character, because I couldn't find a way around that using sed(1).

  • THANKS

    Many thanks go to all good souls from comp.lang.c, whose advice was necessary to do all those programs and to, and Hinrich W. H. Göhlmann and Steve Brewer for ecouraging me in my work.

    NOTE FROM AUTHOR

    I'm not a programmer, and Genpak is amateur work. Everything started because I found myself constantly writing small utilities which could do batch jobs for me, instead of using packages like DNA Star. Graphical user interface is OK, as long you don't have to process like 677 sequences -- and 677 is a number which occurs often during my work, because it is the number of genes in the Mycoplasma pneumoniae genome I am working on. There are also many Unix tools, but they are either hard to use, or to install, or do not even compile on my Linux boxes.

    The programs, I'm sure, have lots of bugs and poor code. For example, I never got the Makefile to work properly. So if you can help me make Genpak a little better, do so -- and mail me.

    SEE ALSO

    gp_digest(1) gp_cusage(1) gp_gc(1) gp_getseq(1) gp_matrix(1) gp_qs(1) gp_randseq(1) gp_seq2prot(1) gp_slen(1) gp_tm(1) gp_findorf(1) gp_acc(1) gp_mkmtx(1)

    DIAGNOSTICS

    All Genpak programs complain in situations you would also complain, like when they cannot find a sequence you gave them or the sequence is not valid.

    The Genpak programs do not write over existing files. I have found this feature very useful :-)

    BUGS

    I'm sure there are plenty left, so please mail me if you find them. I tried to clean up every bug I could find.

    AUTHOR

    January Weiner III <january@bioinformatics.org>

    FORTUNE

    (random generated at packaging time :-)

    Everyone talks about apathy, but no one ____does anything about it.