**-m file**- read the promoter gp_matrix from file
*file*. You should have received a standard*E. coli*gp_matrix file with the distribution (see section**MATRIX**below). **-M**- computate a new gp_matrix, based on frequencies of nucleotides.
**-T value**- when computating a gp_matrix, ignore all promoters,
which score less then
*value*. **-t**- assume that the 5' ends of the genes have been
experimentally defined.
**gp_matrix**assumes that the*exactly*11th nucleotide from the*end*of the sequence evaluated is the experimentally defined transcription start. The +1 gp_matrix will not be taken into account (it makes no sense, since we*know*where the 5' end is). Useful for computating new matrices. **-G value**- adjust the matrix to GC contents equal to
**value**. The matrix depends on the expected frequencies of single bases, and those are dependent on the GC content of the sequences used. Unless you assume that the GC content is roughly equal to 50%, like in the case of- , you need to adjust the matrix to your GC contents.
**-g**- ignore gap penalties.
**-X**- additional options -- set limits for intervals between the -35, -10 and +1 regions. Just write, for example, "-X min1=2", which means: minimal distance between the +1 and -10 is 2 bp. The allowed keywords are: min1, max1, min10, max10.
- , you need to adjust the matrix to your GC contents.

**-S**- don't print out the promotors found. This is useful, if you want to computate a new gp_matrix based on a large data set of genes.
**-p**- show the position relative to sequence absolute start then the promoter sequence (useful for further processing of sequences).
**-N**- show sequence names.
**-H**- print the output in HTML. No HTML headers are printed:
useful for incorporating
**gp_matrix**into CGI scripts. **-v**- Prints the version information.
**-d**- Prints lots of debugging information.
**-h**- Shows usage information.
**inputfile**- file to proces; if not given, will use standard input
**outputfile**- file to write the data to; if not given, will use standard output

Historically, promoters were defined by looking at the consensus
sequence in the region upstream from the transcription start. That
is how the `TATA`

box / Pribnow box was found. Since there is a
variety of different sequences, and not only the core, six-bases
long consensus sequences of the -10 and -35 regions play a role
while acquiring the RNAP, various numerical methods have been used
to predict promoter sequences. The first approach was to computate
the frequencies of nucleotides in the aligned, experimentally
defined promoter regions, and write them as a four row gp_matrix, a
row for each nucleotide. When looking for a promoter in an unknown
sequence, the sequence is aligned along the gp_matrix, and a score
value is computated, adding weights for each nucleotide. This
approach has been further developed (Staden 1984, Hertz 1996).

**gp_matrix** is a program to look for promoters in a set of sequence
files, using the Hertz gp_matrix (see: Hertz, G. and Stormo, G.D.
1996. *Escherichia coli* promoter sequences: analysis and prediction.
Meth. Enzym. 273). Basically, you have a gp_matrix file containing
scores and penalties for nucleotides at different positions in the
supposed -35 and -10 boxes, as well in the +1 region of a given
sequence (see the **MATRIX** section below).
The program loads sequences from the sequence file, and then scans
it using all possible combinations of gap lengths between the boxes
and at all possible positions in the sequence so as to find this
combination which gives the highest score for the sequence. It then
prints a formatted output in the following form:

`#score sequence...[-35 core]...[-10 core]...[start]...`

The '|' characters denote the boundaries of gp_matrix'ed fragments.

Here comes a little more detailed description of how the computations are made.

There are four parts of a gp_matrix, and all of them contribute to the score a certain sequence gets during promoter search. Here is an example which shows how the scoring works. Let's take only one gp_matrix, with some arbitrary numbers (in the real gp_matrix, the numbers are fractions, see below). Look at this example:

This is a gp_matrix from Stormo, 1990, and the numbers are the numbers of occurences of each given nucleotide at each given position. The sequence is aligned with the gp_matrix, then the scores are evaluated, sequence is shifted one base to the right, aligned, evaluated etc. Each time, the score is noted down: the aligment with the highest score wins:

Matrix: A [-28] 18 [ 1] [12] 10 -29 C -15 -31 -12 -10 -2 -22 G -18 -50 -11 -7 -11 [-36] T 17 [17] 10 -10 [-5] 18 Sequence: T A T A A T C G

This sequence scores -28+17+1+12-5-36 = -60. Let's shift it one base to the right:

Sequence: C T A T A A T C G

Now, it scores 17+18+10+12+10+18 = 85. But if we shift it further...

Sequence: C T A T A A T C G

...it will score only -59. Therefore, we suppose that this sequence has the promoter "TATAAT", which fits nicely to other defined promoters.

There are together three matrices: one for each +1, -10 and -35
region. The distances between these matrices are varied during the
program run, and unless the **-g** option is used, also penalties
for unconventional spacings from a fourth gp_matrix are added.

The values in the gp_matrix file are not simply number of occurences of each given nucleotide, as in the example above, which comes from a work by Stormo (1990). Instead, the ln(fo/fe) is computed, where fo is the observed frequency of a given nucleotide at a given position, and fe is the expected frequency of a given nucleotide at a given position (which depends on the GC % of the sequence). Such approach has a deeper statistical meaning, because fo/fe is a probability, and ln(p1)+ln(p2)=ln(p1*p2), so the probabilities are multiplied (like they should) instead of being added.

For each given sequence, **gp_matrix** tries all allowed combinations
of the three matrices, and prints out the one with the highest
score. If this score is greater then the treshold (set with option
**-T value**, default: -99), then the new gp_matrix is adjusted.

If the **-t** option is set, **gp_matrix** assumes that the
transcription starts have been defined experimentally and therefore
there is no need to evaluate the +1 gp_matrix -- since it is fixed by
the experimental transcription start. It is advisable to use the
option **-g** together with **-t**.

There are plenty of other methods, including neuronal networks and very sophisticated analytical approaches. Of course, the Hertz method has one major drawback -- it concatenates the different promotor sequences characteristic, for example, for different sigma factors into one gp_matrix. On the other hand, matrices are very robust and much easier to implement, and can do a very good job in promoter recognition.

All **Genpak** programs complain in situations you would also complain,
like when they cannot find a sequence you gave them or the sequence is not
valid.

The **Genpak** programs do not write over existing files. I have found this
feature very useful :-)

I'm sure there are plenty left, so please mail me if you find them. I tried to clean up every bug I could find.

January Weiner III <january@bioinformatics.org>