[ssml] SUMMARY: Unusual amino-acid composition ?

Wed Jul 27 10:29:37 EDT 2005

hi folks,

better late than never - here's a summary of the responses to my question of 
mid-June:

-------------------------------------------------------
we are writing up the structure determination of a dimeric human enzyme. while 
going through the model (~750 residues per monomer), i noticed that the 
protein contains rather few lysines (1.8%) and isoleucines (2.7%), and rather 
many prolines (7.5%) and phenylalanines (6.5%). (if i remember correctly, 
there are no low-complexity regions in the sequence.)

i would be grateful for any clues or literature references that might tell us 
if this is statistically to be expected or unusual and -if the latter- what 
could explain it, and whether or not it might have any significance. also, a 
pointer to a table of the average amino-acid composition of soluble human 
proteins (or enzymes) would be useful.
-------------------------------------------------------

the following replies were gratefully received:

-------------------------------------------------------
From: Robbie Joosten <r.joosten at cmbi.ru.nl>

I cannot find an amino acid distribution for human enzymes specifically but 
the data used for the PAM matrix should be a good indication for the normal 
distribution of amino acids: 
http://apps.bioneq.qc.ca/twiki/pub/Knowledgebase/PAM/PAM2.JPG Have you checked 
Swissprot for homologues of you protein? If this strange amino acid 
distribution is discussed before, the references in Swissprot entries should 
hold links to the relevant articles.
-------------------------------------------------------
From: Ron Viola <ron.viola at utoledo.edu>

A short paper by Mike Klapper [BBRC 78, 1018 (1977)] looked at amino acid 
composition and distribution in a sample set of proteins.
-------------------------------------------------------
From: Paul Mc Laughlin <paul.mclaughlin at ed.ac.uk>

Maybe not exactly what you want, but the following is a survey of composition 
in pdb , a good while ago, P.McCaldon &P.Argos, Proteins 4:99-122,1988. I half 
remember something more up to date, but a citation search should find it.

I wrote a little web form 
http://chon.bch.ed.ac.uk/paul/SEQUENCE/percentage.html into which you can put 
your sequence and it will tell you the percentages next each amino-acid , the 
value from the above paper, and the deviation. I found it useful for scanning 
a protein sequence before you even start to clone the gene. If it has high P 
and S, then advised to think again.

Of course a better web form would show a distribution of percentages for each 
amino-acid and where the percentage for a particular type in your protein 
lies. I don't know if someone has done that.
-------------------------------------------------------
From: Dan Bolser <dmb at mrc-dunn.cam.ac.uk>

I don't know of any tables or literature off hand (I am sure there are 
plenty), but you can quite easily generate the statistics from a non-redundant 
set of sequences (for example UniParc).

Use this 'background' set to generate your 'expected' frequency for each amino 
acid, then compare this to the 'observed' frequency from your protein.

The stats are simply a case of comparing the observed and expected frequencies 
to get some measure of 'unusual' (along with a significance).

Often people quote log(likelyhood), coming from the log odds ratio.

It gets rapidly more complecated (technically) when you try to consider 
different 'populations' of amino acids, for example suface amino acids (which 
are known to have a different distribution from core amino acids). However, 
the basic idea is the same.
-------------------------------------------------------
From: Marko Hyvonen <marko at cryst.bioc.cam.ac.uk>

Creighton's proteins has a table, referenced as McCaldon & Argos, Proteins 
4:99-122, 1988. So a bit out of date. The relevant bits...

Ile 5.2%
Lys 5.7%
Phe 3.9%
Pro 5.1%
-------------------------------------------------------
From: Daniel Rigden <drigden at liverpool.ac.uk>

SAPS gives you stats related to overall composition, as well as a bunch of 
other things

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=1549558&query_hl=28

http://www.isrec.isb-sib.ch/software/SAPS_form.html
-------------------------------------------------------
From: Mensur Dlakic <mdlakic at montana.edu>

Below I'll paste background AA frequencies that were in use by BLAST and HMMer 
suites of programs last time I checked (probably 5 years ago). I doubt that 
they changed much in the meantime and you can find out for sure by digging 
through the codes of these two programs.
[...]
For example, present AA frequencies from uniref50 database ( 
http://www.pir.uniprot.org/ ) are below and they don't seem to be much 
different (I have the program to count these for Windows and Linux if there is 
interest). Finally, when you say that there is no compositional bias I assume 
you used SEG or something similar. It is worth trying CAST ( 
http://www.ebi.ac.uk/research/cgg/services/cast/ ), which also delineates 
biased regions but in a conceptually different way from SEG ( 
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=111206 
81&query_hl=1 ).
[...]
-------------------------------------------------------
From: Kevin Karplus <karplus at soe.ucsc.edu>

I have some Dirichlet mixtures, trained on the composition of proteins in a 
reduced-redundancy database.  There are many proteins with compositions a long 
way from the background.

One can look at the statistics for
         log P(counts| Dirichlet mixture)
but there is a strong length dependence (roughly linear), so
         log(P(counts| Dirichlet mixture))  / length
is probably the statistic to look at.

One can compute this for a large number of proteins, then compare with the 
value for the specific protein, to see how unusual it is.
-------------------------------------------------------
From: Bart Hazes <Bart.Hazes at ualberta.ca>

I cut-and-pasted "average amino-acid composition" (including quotes) into 
google and the second hit was AMINO ACID FREQUENCY 
<http://www.tiem.utk.edu/%7Egross/bioed/webmodules/aminoacid.htm> 
(http://www.tiem.utk.edu/~gross/bioed/webmodules/aminoacid.htm). It's for 
vertebrates in general rather than humans but we really aren't that different 
from the average marsupial when it comes to amino acid frequencies.

I've come across one case of proteins with reduced lysine content in secreted 
protein toxins that enter their eukaryotic target cells via retrograde 
transport to the ER. Randy Read came up with the idea that the reduced lysine 
content may help protect the toxins from becoming ubiquitinated as they 
entered the cytoplasm through the ERAD pathway. That has subsequently been 
shown to be the case.

*Hazes B, Read RJ. (1997). **Accumulating evidence suggests that several 
AB-toxins subvert the endoplasmic reticulum-associated protein degradation 
pathway to enter target cells. *Biochemistry 36, 11051-4
-------------------------------------------------------
From: rjoosten at cmbi.ru.nl

There may be reason for the low lysine and high proline content in the field 
of protein stability. In protein engenering mutations Lys-Arg and Any-Pro are 
seen stabilising because less entropy is lost upon protein folding. Isoleucine 
and phenylalanine have an equal amount of rotatable bonds but but the stacking 
interactions of phenylalanine are favourable (entropy of water). Where does 
your enzyme occur in the body/cell?
-------------------------------------------------------
From: Namasivayam Gautham <gautham at unom.ac.in>

Somewhat related information:

    Correlations between nucleotide frequencies and amino acid composition in 
115 bacterial genomes.  (2004) Biophys. Biochem. Res. Commun., 315/4, 
1097-1103. DOI information: 10.1016/j.bbrc.2004.01.129 (and references 
therein)
[...]
-------------------------------------------------------
From: Savvas N. Savvides <savvas.savvides at ugent.be>

This is indeed a very intriguing amino acid composition you observe. Table 2 
in a review article on instrinsically unstructured proteins by Tompa [TIBS 
27(10), 527 (2002] could be interesting to look at. I think that it would also 
be useful to look at these percentages in the context of the form and 
percentage of secondary structure elements in your structure. And also how 
your observed amount of secondary structure compares to average values for the 
folds of the different domains in your protein (assuming of course that you do 
not have a new fold).
-------------------------------------------------------
From: Randy J. Read <rjr27 at cam.ac.uk>

As it happens, Bart Hazes and I published a paper on one situtation where low 
lysine content is relevant, i.e. in bacterial toxins that enter cells by 
retrograde trafficking through the endoplasmic reticulum. By having few or no 
lysines, they avoid being ubiquitinated and thereby degraded by the 
proteasome. Don't know how likely this is to be relevant to your protein!

The reference is B. Hazes & R.J. Read (1997), "Accumulating evidence suggests 
that several AB-toxins subvert the endoplasmic reticulum-associated protein 
degradation pathway to enter target cells", Biochemistry 36, 11051-11054.

Just had a look, and we don't seem to cite any references giving average 
lysine content, but my vague recollection is that the number is something like 
7%.
-------------------------------------------------------

a summary of some amino-acid frequency tables:

AA  PAM OWN HMR BLA U50 VER   MIN - MAX   XXX
--- --- --- --- --- --- ---   ---------   ---
ALA 9.6 8.1 7.6 7.5 8.1 7.4   7.4 - 9.6   8.9
GLY 9.0 8.0 6.8 7.0 6.7 7.4   6.7 - 9.0   8.2
LYS 8.5 5.9 5.9 5.4 5.4 7.2   5.4 - 8.5   1.8 --
LEU 8.4 8.1 9.3 8.8 9.7 7.6   7.6 - 9.7   9.7
VAL 7.8 7.1 6.5 6.1 6.4 6.8   6.1 - 7.8   7.2
THR 6.2 6.3 5.7 5.5 5.5 6.2   5.5 - 6.3   4.6
SER 5.7 6.8 7.2 6.8 7.6 8.1   5.7 - 8.1   6.9
ASP 5.3 5.8 5.3 5.0 5.3 5.9   5.0 - 5.9   4.6
GLU 5.3 5.8 6.3 6.0 6.3 5.8   5.3 - 6.3   4.8
PHE 4.5 4.0 4.1 3.5 3.9 4.0   3.5 - 4.5   6.5 +
ASN 4.2 4.6 4.5 4.1 4.4 4.4   4.1 - 4.6   3.4
PRO 4.1 4.7 4.9 4.9 5.2 5.0   4.1 - 5.2   7.5 +
ILE 3.5 5.3 5.7 4.8 5.5 3.8   3.5 - 5.7   2.7
HIS 3.4 2.2 2.2 1.9 2.3 2.9   1.9 - 3.4   3.4
ARG 3.4 4.4 5.2 4.8 5.7 4.2   3.4 - 5.7   5.5
GLN 3.2 3.7 4.0 3.9 4.0 3.7   3.2 - 4.0   4.8
TYR 3.0 3.8 3.2 2.9 3.0 3.3   2.9 - 3.8   4.0
CYS 2.5 1.9 1.7 1.7 1.5 3.3   1.5 - 3.3   1.3
MET 1.2 2.0 2.4 2.0 2.2 1.8   1.2 - 2.4   1.7
TRP 1.2 1.6 1.3 1.1 1.2 1.3   1.1 - 1.6   1.8

PAM = http://apps.bioneq.qc.ca/twiki/pub/Knowledgebase/PAM/PAM2.JPG
OWN = my own numbers based on a unique subset of the PDB (ca. 1998)
HMR = HMMER background amino-acid frequencies <mdlakic at montana.edu>
BLA = BLAST background amino-acid frequencies <mdlakic at montana.edu>
U50 = Uniref50 frequencies <mdlakic at montana.edu>
VER = http://www.tiem.utk.edu/~gross/bioed/webmodules/aminoacid.htm
MIN = minimum value of the above
MAX = maximum value of the above
XXX = value for our protein

note that very large fluctuations occur between the different tables. but by 
any standard, the lysine content of our protein is pretty low. of course, that 
might be related to the fact that this is an extracellular protein (andrade et 
al., j mol biol 276, pp 517 (1998))

--gerard

******************************************************************
                         Gerard J.  Kleywegt
     [Research Fellow of the Royal  Swedish Academy of Sciences]
Dept. of Cell & Molecular Biology  University of Uppsala
                 Biomedical Centre  Box 596
                 SE-751 24 Uppsala  SWEDEN

     http://xray.bmc.uu.se/gerard/  mailto:gerard at xray.bmc.uu.se
******************************************************************
    The opinions in this message are fictional.  Any similarity
    to actual opinions, living or dead, is purely coincidental.
******************************************************************