Ends Methods

The following inquiry was sent to the Jmol-Users email list on 2017-12-22. Determination of chain termini was implemented following the methods described below. (original email list post).

Finding the ends of a protein chain is easy for a human being. But writing code to find the ends of the protein chains in any PDB file in a general manner seems challenging. I have procrastinated dealing with this for years because of its seeming complexity. Here I lay out my ideas about how to do this. If anyone has other ideas, especially simpler methods, please let me know!

Why? I would like to have buttons in FirstGlance that zoom in on the terminal residues (with coordinates) of a protein chain for any PDB entry displayed. Also, in the Charge view in FirstGlance, I would like to show whether the terminal amino and carboxy residues that have coordinates are charged -- that is, whether they are the termini of the experimental protein, or whether the actual terminal amino acids are missing coordinates. Or whether the terminal amino acids are blocked.

I prefer to rely on sequence numbers as LITTLE as possible because they are not required to increase monotonically between the N and C termini, may skip values, and may have insertion codes with the same sequence numbers. See examples in a new article I wrote recently: http://proteopedia.org/w/Unusual_sequence_numbering

For the C terminus, I expected that I could rely upon OXT (a C-terminal oxygen atom). Then if a protein chain has no OXT, it would mean that the C terminal residue(s) are missing coordinates. However, I have stumbled across several examples with zero missing residues, yet no OXT on the C-terminal residue. Rachel Kramer Green at RCSB has confirmed that there is no requirement for an OXT atom on the C-terminal residue. It is up to the authors of the PDB entry. Therefore one cannot rely on OXT.

Thus, when OXT is present, then that residue is indeed an end residue and is charged. But when OXT is absent, you don't know.

Quite recently, I realized that that Jmol's groupindex may offer the best solution. If there is a better or simpler solution, I would like to know it! Groupindex is not the sequence number. It is a group/residue number assigned by Jmol for internal use. Jmol assigns somewhat different groupindex numbering schemes to PDB format vs. mmCIF format files.

I will make two assumptions that appear to me to be true.

In PDB files (including mmCIF format), residues within a chain are always listed in order from N terminus to C terminus.
groupindex increases monotonically in the order in which residues are listed in the PDB file.

METHODS

1. Minimal or maximal groupindex in each protein chain

Let's try for the N terminal residue of chain A by finding the lowest groupindex:

AMin = {chain=A and protein}.groupindex.min; select groupindex=AMin

In determining the minimum groupindex, one cannot exclude HETERO groups because termini may be non-standard amino acids, or not even amino acids (e.g. acetyl, ACE). Examples: the N-terminal residues of 12 of the 24 chains in 4gxu are pyroglutamic acid [PCA]1, a hetero group. The N-terminal residues in 4mdh are ACE, acetyl.

2. Is the end candidate peptide-bonded to the adjacent residue?

Next we need to determine if the chain A residue with the lowest groupindex is peptide-bonded to the residue with the next higher groupindex. At the N-terminus I expect that it always will be but perhaps I will be surprised.

select groupindex=0 and *.c and connected(groupindex=1 and *.n)

But lets consider the C-terminus, where we will start with the chain A protein residue with the highest groupindex. Ligands and water are also assigned "chain A" and as far as I know, always have groupindices higher than the protein chain members. Water is excluded by limiting ourselves to protein. But ligands are not.

Some ligands are deemed "protein" by Jmol, presumaby because they have an alpha carbon (e.g. 3ES in 2xy9), but are not covalent members of the protein chain.

When the group with the highest index is not covalently peptide-bonded to the chain, our next candidate will be the next lower groupindex, and so forth until we find the highest groupindex that is peptide-bonded to the chain.

3. Mono- and dipeptide ligands

All chains of length 3 or more amino acids are represented with ATOM records, have distinct chain identifiers (A, B, C, etc.), and have SEQRES records.

However, single amino acid ligands (e.g. Gly308 in 4cpa) and dipeptides are by PDB rules deemed HETERO, have no SEQRES records (but HET and HETNAM records), and are assigned the same chain identifier as the chain to which they are bound. Rachel Kramer Green of RCSB has confirmed this rule. (There are a handful of cases deposited by PDBe that do not conform to this rule -- they will be remediated.)

A single amino acid is easily excluded since it is not peptide-bonded to the residue having one groupindex lower. However, the C-terminal residue of a dipeptide ligand is peptide-bonded to the residue one groupindex lower, but is not part of the larger protein chain with the same chain identifier. Therefore we must require that any candidate for a terminal residue be peptide bonded to the adjacent residue, and that adjacent residue be peptide-bonded to the next adjacent residue (THREE peptide bonded residues).

For example 3deq contains chain B of length 345 amino acids, 341 of which have coordinates (2 missing at each end). Sequence numbers are 3-343 and groupindices 436-776. Bound to chain B is a dipeptide (HETATM, also deemed "chain B") Ala411-Leu412. Jmol assigns this dipeptide groupindices 777-778 for the PDB format file, and 1366-1367 for the mmCIF file.

Taking the PDB file of 3deq, dipeptide groupindex 778 (the highest protein groupindex in chain B) is peptide-bonded to 777, but 777 is NOT peptide-bonded to 776, and thus 778 is REJECTED as the C-terminus of chain B. When we get to 776, we find it is peptide-bonded to 775, which is peptide-bonded to 774, and thus we ACCEPT 776 as the C-terminal residue.

4. Is the terminal residue with coordinates charged?

Now we have determined the terminal residues with coordinates in each chain. But are their amino or carboxy termini charged?

4A. Is the terminal residue missing coordinates?

If the terminal residue is missing coordinates due to disorder, then the terminal residue with coordinates (determined by the above methods) is NOT charged.

If the C-terminal residue has an OXT atom, it is charged (and there cannot be a more-C-terminal residue that is missing coordinates).

For all N-terminal residues, and C-terminal residues lacking OXT, we need to ask whether the terminal residue in the experimental protein is missing coordinates. (Recall that authors often fail to put OXT atoms on C-terminal residues, and this is "legal".)

I am aware of two possible methods.

Determine whether the next more-terminal sequence number (ATOM or HETERO) is missing.
Examine the sequence alignment between SEQRES and ATOM records to determine whether the terminal candidate is indeed the terminus in SEQRES.

I plan to use method 1 although in rare cases where near-terminal sequence numbers are not sequential, this would fail.

(mmCIF files contain the SEQRES to ATOM sequence alignment, lacking in PDB format files, but I have not yet adapted FirstGlance to use mmCIF.)

Missing residues are listed in REMARK 465 by chain, residue name, and sequence number.

Take the case of 3deq. We have decided, using the methods above, that Lys343 (groupindex 776) is the C-terminal residue with coordinates in chain B. But we find that chain B sequence number 344 is listed as missing coordinates in REMARK 465. Therefore, we know that Lys343 is not the real C-terminus, and hence it has no carboxy-terminal charge.

Although I don't know of an example (and don't know how to search for one), there may be cases where the terminal residue is missing and is not protein. Presumably an N-terminal acetyl group that is missing coordinates would be listed in REMARK 465.

4B. Is the terminal residue blocked?

Fairly often, the N terminus may be acetylated with an ACE hetero group (e.g. 4mdh).

We can determine whether the N-terminus is blocked by asking whether its main chain nitrogen atom is covalently bonded (Jmol "connected" function) to two non-hydrogen atoms (three in the case of proline). If yes, then it is not charged. If no, and if the real N-terminal residue is not missing, then our N-terminus candidate is charged.

The same strategy can be used to determine whether the C terminus is blocked. Is the C-terminal main chain carboxy carbon, lacking OXT, covalently bonded to three non-hydrogen atoms?

In the case of chain C in 5fpk, the C-terminal Ala4 is amidated. So its main chain carboxy carbon is covalently bonded to 3 non-hydrogen atoms: the alpha carbon, the main chain oxygen, and a nitrogen with the group name NH2. Thus, the carboxy terminus of chain C is NOT charged.

5. Missing main chain atoms

In rare cases, some atoms of an amino acid have coordinates but not all main chain atoms do. In such cases, the above methods will fail. I will try to make FirstGlance alert the user in all such cases.

1h3o: the C-terminal residues of chains A and C, Thr918, have only a single atom, N. Jmol manages to deem this atom protein, despite it lacking an alpha carbon.

Whew!

Comments? Suggestions?

Thanks, Eric