Historically, promoters were defined by looking at the consensus
sequence in the region upstream from the transcription start. That
is how the TATA
box / Pribnow box was found. Since there is a
variety of different sequences, and not only the core, six-bases
long consensus sequences of the -10 and -35 regions play a role
while acquiring the RNAP, various numerical methods have been used
to predict promoter sequences. The first approach was to computate
the frequencies of nucleotides in the aligned, experimentally
defined promoter regions, and write them as a four row gp_matrix, a
row for each nucleotide. When looking for a promoter in an unknown
sequence, the sequence is aligned along the gp_matrix, and a score
value is computated, adding weights for each nucleotide. This
approach has been further developed (Staden 1984, Hertz 1996).
gp_matrix is a program to look for promoters in a set of sequence files, using the Hertz gp_matrix (see: Hertz, G. and Stormo, G.D. 1996. Escherichia coli promoter sequences: analysis and prediction. Meth. Enzym. 273). Basically, you have a gp_matrix file containing scores and penalties for nucleotides at different positions in the supposed -35 and -10 boxes, as well in the +1 region of a given sequence (see the MATRIX section below). The program loads sequences from the sequence file, and then scans it using all possible combinations of gap lengths between the boxes and at all possible positions in the sequence so as to find this combination which gives the highest score for the sequence. It then prints a formatted output in the following form:
#score sequence...[-35 core]...[-10 core]...[start]...
The '|' characters denote the boundaries of gp_matrix'ed fragments.
Here comes a little more detailed description of how the computations are made.
There are four parts of a gp_matrix, and all of them contribute to the score a certain sequence gets during promoter search. Here is an example which shows how the scoring works. Let's take only one gp_matrix, with some arbitrary numbers (in the real gp_matrix, the numbers are fractions, see below). Look at this example:
This is a gp_matrix from Stormo, 1990, and the numbers are the numbers of occurences of each given nucleotide at each given position. The sequence is aligned with the gp_matrix, then the scores are evaluated, sequence is shifted one base to the right, aligned, evaluated etc. Each time, the score is noted down: the aligment with the highest score wins:
Matrix: A [-28] 18 [ 1] [12] 10 -29 C -15 -31 -12 -10 -2 -22 G -18 -50 -11 -7 -11 [-36] T 17 [17] 10 -10 [-5] 18 Sequence: T A T A A T C G
This sequence scores -28+17+1+12-5-36 = -60. Let's shift it one base to the right:
Sequence: C T A T A A T C G
Now, it scores 17+18+10+12+10+18 = 85. But if we shift it further...
Sequence: C T A T A A T C G
...it will score only -59. Therefore, we suppose that this sequence has the promoter "TATAAT", which fits nicely to other defined promoters.
There are together three matrices: one for each +1, -10 and -35 region. The distances between these matrices are varied during the program run, and unless the -g option is used, also penalties for unconventional spacings from a fourth gp_matrix are added.
The values in the gp_matrix file are not simply number of occurences of each given nucleotide, as in the example above, which comes from a work by Stormo (1990). Instead, the ln(fo/fe) is computed, where fo is the observed frequency of a given nucleotide at a given position, and fe is the expected frequency of a given nucleotide at a given position (which depends on the GC % of the sequence). Such approach has a deeper statistical meaning, because fo/fe is a probability, and ln(p1)+ln(p2)=ln(p1*p2), so the probabilities are multiplied (like they should) instead of being added.
For each given sequence, gp_matrix tries all allowed combinations of the three matrices, and prints out the one with the highest score. If this score is greater then the treshold (set with option -T value, default: -99), then the new gp_matrix is adjusted.
If the -t option is set, gp_matrix assumes that the transcription starts have been defined experimentally and therefore there is no need to evaluate the +1 gp_matrix -- since it is fixed by the experimental transcription start. It is advisable to use the option -g together with -t.
There are plenty of other methods, including neuronal networks and very sophisticated analytical approaches. Of course, the Hertz method has one major drawback -- it concatenates the different promotor sequences characteristic, for example, for different sigma factors into one gp_matrix. On the other hand, matrices are very robust and much easier to implement, and can do a very good job in promoter recognition.
All Genpak programs complain in situations you would also complain, like when they cannot find a sequence you gave them or the sequence is not valid.
The Genpak programs do not write over existing files. I have found this feature very useful :-)
I'm sure there are plenty left, so please mail me if you find them. I tried to clean up every bug I could find.
January Weiner III <january@bioinformatics.org>