gp_matrix

GP

2000

NAME

gp_matrix - search for promoter sequences

SYNOPSIS

gp_matrix [-m file] [-M] [-T value] [-S] [-t] [-g] [-H] [-q] [-v] [-d] [-h] [inputfile] [outputfile]

OPTIONS

BIOLOGICAL OPTIONS

-m file

read the promoter gp_matrix from file file. You should have received a standard E. coli gp_matrix file with the distribution (see section MATRIX below).

-M

computate a new gp_matrix, based on frequencies of nucleotides.

-T value

when computating a gp_matrix, ignore all promoters, which score less then value.

-t

assume that the 5' ends of the genes have been experimentally defined. gp_matrix assumes that the exactly 11th nucleotide from the end of the sequence evaluated is the experimentally defined transcription start. The +1 gp_matrix will not be taken into account (it makes no sense, since we know where the 5' end is). Useful for computating new matrices.

-G value

adjust the matrix to GC contents equal to value. The matrix depends on the expected frequencies of single bases, and those are dependent on the GC content of the sequences used. Unless you assume that the GC content is roughly equal to 50%, like in the case of

, you need to adjust the matrix to your GC contents.

-g

ignore gap penalties.

-X

additional options -- set limits for intervals between the -35, -10 and +1 regions. Just write, for example, "-X min1=2", which means: minimal distance between the +1 and -10 is 2 bp. The allowed keywords are: min1, max1, min10, max10.

OUTPUT OPTIONS

-S

don't print out the promotors found. This is useful, if you want to computate a new gp_matrix based on a large data set of genes.

-p

show the position relative to sequence absolute start then the promoter sequence (useful for further processing of sequences).

-N

show sequence names.

-H

print the output in HTML. No HTML headers are printed: useful for incorporating gp_matrix into CGI scripts.

-v

Prints the version information.

-d

Prints lots of debugging information.

-h

Shows usage information.

inputfile

file to proces; if not given, will use standard input

outputfile

file to write the data to; if not given, will use standard output

DESCRIPTION

Historically, promoters were defined by looking at the consensus sequence in the region upstream from the transcription start. That is how the TATA box / Pribnow box was found. Since there is a variety of different sequences, and not only the core, six-bases long consensus sequences of the -10 and -35 regions play a role while acquiring the RNAP, various numerical methods have been used to predict promoter sequences. The first approach was to computate the frequencies of nucleotides in the aligned, experimentally defined promoter regions, and write them as a four row gp_matrix, a row for each nucleotide. When looking for a promoter in an unknown sequence, the sequence is aligned along the gp_matrix, and a score value is computated, adding weights for each nucleotide. This approach has been further developed (Staden 1984, Hertz 1996).

gp_matrix is a program to look for promoters in a set of sequence files, using the Hertz gp_matrix (see: Hertz, G. and Stormo, G.D. 1996. Escherichia coli promoter sequences: analysis and prediction. Meth. Enzym. 273). Basically, you have a gp_matrix file containing scores and penalties for nucleotides at different positions in the supposed -35 and -10 boxes, as well in the +1 region of a given sequence (see the MATRIX section below). The program loads sequences from the sequence file, and then scans it using all possible combinations of gap lengths between the boxes and at all possible positions in the sequence so as to find this combination which gives the highest score for the sequence. It then prints a formatted output in the following form:

#score sequence...[-35 core]...[-10 core]...[start]...

The '|' characters denote the boundaries of gp_matrix'ed fragments.

MATRIX

Here comes a little more detailed description of how the computations are made.

There are four parts of a gp_matrix, and all of them contribute to the score a certain sequence gets during promoter search. Here is an example which shows how the scoring works. Let's take only one gp_matrix, with some arbitrary numbers (in the real gp_matrix, the numbers are fractions, see below). Look at this example:

This is a gp_matrix from Stormo, 1990, and the numbers are the numbers of occurences of each given nucleotide at each given position. The sequence is aligned with the gp_matrix, then the scores are evaluated, sequence is shifted one base to the right, aligned, evaluated etc. Each time, the score is noted down: the aligment with the highest score wins:


	Matrix: A   [-28]  18  [ 1] [12]  10  -29

	        C    -15  -31  -12  -10   -2  -22

	        G    -18  -50  -11   -7  -11 [-36]

	        T     17  [17]  10  -10  [-5]  18

	Sequence: T    A    T    A    A    T    C    G

This sequence scores -28+17+1+12-5-36 = -60. Let's shift it one base to the right:

Sequence:  C   T    A    T    A    A    T    C    G

Now, it scores 17+18+10+12+10+18 = 85. But if we shift it further...

Sequence:      C    T    A    T    A    A    T    C    G

...it will score only -59. Therefore, we suppose that this sequence has the promoter "TATAAT", which fits nicely to other defined promoters.

There are together three matrices: one for each +1, -10 and -35 region. The distances between these matrices are varied during the program run, and unless the -g option is used, also penalties for unconventional spacings from a fourth gp_matrix are added.

The values in the gp_matrix file are not simply number of occurences of each given nucleotide, as in the example above, which comes from a work by Stormo (1990). Instead, the ln(fo/fe) is computed, where fo is the observed frequency of a given nucleotide at a given position, and fe is the expected frequency of a given nucleotide at a given position (which depends on the GC % of the sequence). Such approach has a deeper statistical meaning, because fo/fe is a probability, and ln(p1)+ln(p2)=ln(p1*p2), so the probabilities are multiplied (like they should) instead of being added.

For each given sequence, gp_matrix tries all allowed combinations of the three matrices, and prints out the one with the highest score. If this score is greater then the treshold (set with option -T value, default: -99), then the new gp_matrix is adjusted.

If the -t option is set, gp_matrix assumes that the transcription starts have been defined experimentally and therefore there is no need to evaluate the +1 gp_matrix -- since it is fixed by the experimental transcription start. It is advisable to use the option -g together with -t.

ALTERNATIVES

There are plenty of other methods, including neuronal networks and very sophisticated analytical approaches. Of course, the Hertz method has one major drawback -- it concatenates the different promotor sequences characteristic, for example, for different sigma factors into one gp_matrix. On the other hand, matrices are very robust and much easier to implement, and can do a very good job in promoter recognition.

DIAGNOSTICS

All Genpak programs complain in situations you would also complain, like when they cannot find a sequence you gave them or the sequence is not valid.

The Genpak programs do not write over existing files. I have found this feature very useful :-)

BUGS

I'm sure there are plenty left, so please mail me if you find them. I tried to clean up every bug I could find.

AUTHOR

January Weiner III <january@bioinformatics.org>