1D protein structure prediction : 00 README

Main.00README History

Hide minor edits - Show changes to output

April 11, 2006, at 01:08 AM by 133.39.116.42 -

Changed lines 1-2 from:

! 00README

to:

Restore

April 11, 2006, at 01:07 AM by 133.39.116.42 -

Changed lines 3-4 from:

to:

Restore

April 11, 2006, at 01:06 AM by 133.39.116.42 -

Added lines 1-264:

! 00README

[=

* 0. Introduction
This directory contains a computer program for predicting one-dimensional
protein structures (secondary structures [SS], contact numbers [CN], and
residue-wise contact orders [RWCO]) by the method of critical random networks
described in:

Ref. 1 (Description of the software)
"CRNPRED: Highly accurate prediction of one-dimensional protein structures
by large-scale critical random networks."
Kinjo AR, Nishikawa K.
submitted (2006)

and

Ref. 2 (Method of critical random networks)
"Predicting secondary structures, contact numbers, and residue-wise contact
orders of native protein structure from amino acid sequence using critical
random networks."
Kinjo AR, Nishikawa K.
BIOPHYSICS, 1:67-74 (2005) (DOI: 10.2142/biophysics.1.67).

This software is in public domain. You can use, modify and/or destroy it freely,
but we do not take any responsibility for the consequences of your use.

* 1. INSTALLING CRNPRED.
To install the CRNPRED program, you need the following:
(0) UNIX-like operating system (Linux, MacOS X, *BSD, etc.)
(1) bash (or zsh)
(2) make
(3) gcc
(4) PSI-BLAST and related databases (amino acid sequences and BLOSUM
scoring matrices).

First, set the environment variable CRNPRED_DIR to this directory (that is,
the directory containing this file "00README").
If you are using sh, ksh, bash, or zsh, write

export CRNPRED_DIR=/path/to/this/directory

in your ~/.profile and do
% . ~/.profile

If you are using csh or tcsh, write

setenv CRNPRED_DIR /path/to/this/directory

in your ~/.cshrc and do
% source ~/.cshrc

To compile the program, do

% (cd ${CRNPRED_DIR}/src; make install)

Then the program named "xpredm" is installed under the directory
${CRNPRED_DIR}/bin.

After xpredm has been installed, test it by running

% ${CRNPRED_DIR}/bin/xpredm sample/d3nul__.prof > hoge.out

Compare hoge.out with sample/d3nul__.out. There are a few sample
inputs and outputs in the directory named "sample".

* 2. RUNNING CRNPRED.
Make sure you have set the environment variable CRNPRED_DIR appropriately.
A utility shell script "run_crn.sh" is available for your convenience.
If you have FASTA format amino acid sequence file (say, "test.seq"), do

${CRNPRED_DIR}/bin/run_crn.sh -d uniref100 test.seq

where "uniref100" is the sequence database used by PSI-BLAST.
Then, after some time, you have a file named "test.seq.d.out" which contains
the result of the prediction. If it does not work, check the content of
"run_crn.sh" and modify the environment variables such as BLASTDB, BLASTMAT,
and CRNPRED_DIR, or you may have to change the first line "#!/bin/sh" to
something like "#!/usr/bin/env bash" or "#!/usr/bin/env/ zsh".
Run
${CRNPRED_DIR}/bin/run_crn.sh -h
to see other options.

Alternatively, you can directly run the program. You first need to run
PSI-BLAST to make a position-specific scoring matrix:

blastpgp -d nr -h 0.0005 -j 3 -i test.seq -Q test.prof > /dev/null

Then do

${CRNPRED_DIR}/bin/xpredm test.prof > test.out

The result is saved in "test.out".

* 3. INTERPRETING THE RESULTS.
Below is an example of prediction.

* Lines starting with "AA" show the amino acid sequence you fed.
* Lines starting with "SS" show the predicted secondary structures
where "H", "E", and "C" mean "alpha-helix", "beta-strand", and "coils",
respectively.
* Lines starting with "CN" show the predicted contact numbers in 2-state
description where "B" and "E" mean "buried" and "exposed", respectively.
The threshold values are the average contact number for each residue
type (see Appendix below for the list of the average contact numbers).
* Lines after "># AA : SS P_H P_E P_C : CN : RWCO" are the details of the
prediction:
o The column corresponding to "AA" indicates the residue numbers
and the amino acid residues.
o The column corresponding to "SS" indicates the predicted secondary
structure followed by the ad hoc probability for each secondary
structure class (i.e., "P_H" for the probability for the residue to
be in the alpha-helix class, etc.).
o The column corresponding to "CN" indicates the predicted contact
numbers in 2-state description ("B" or "E") followed by the real
predicted contact numbers.
o The column corresponding to "RWCO" indicates the predicted
residue-wise contact orders (real numbers).

---------sample output starts here--------
>prediction for: test.prof

# * * * * * *
AA: SWQSYVDDHLMCDVEGNHLTAAAILGQDGSVWAQSAKFPQLKPQEIDGIKKDFEEPGFLA
SS: CCHHHHHHHHHCCCCCCCCHEEEEECCCCCEEEECCCCCCCCHHHHHHHHHCCCCCCCCC
CN: BBBBBBEBBEBBBBBBBBEEEEEEEEEEEEEEEEEBBBBBEEBBEBBBEEBEBBBBBBBB
# * * * * * *
AA: PTGLFLGGEKYMVIQGEQGAVIRGKKGPGGVTIKKTNQALVFGFYDEPMTGGQCNLVVER
SS: CCEEEECCCEEEEEECCCCEEEEECCCCCEEEEEECCCEEEEEEECCCCCCHHHHHHHHH
CN: EBEEBEBBEEEEEEEBBBBBBEEEEEBBBEEEEEEEEEEEEEEEBBBBBBBBEEEBEEEB
# *
AA: LGDYLIESEL
SS: HHHHHHHCCC
CN: EEEBEBBBBB
//

># AA : SS P_H P_E P_C : CN : RWCO
1 S : C 11 7 82 : B 14 : 840
2 W : C 23 10 67 : B 22 : 1221
3 Q : H 59 11 30 : B 18 : 864
4 S : H 79 8 12 : B 18 : 860
5 Y : H 86 6 7 : B 25 : 1199
6 V : H 89 5 6 : B 27 : 1276
7 D : H 90 5 6 : E 21 : 855
8 D : H 90 4 6 : B 17 : 728
9 H : H 89 5 6 : B 22 : 954
10 L : H 85 6 8 : E 30 : 1188
11 M : H 72 9 18 : B 24 : 850
12 C : C 44 11 46 : B 22 : 826
13 D : C 18 8 73 : B 18 : 669
14 V : C 10 7 83 : B 22 : 751
15 E : C 8 6 86 : B 17 : 593
16 G : C 8 7 85 : B 18 : 640
17 N : C 10 8 82 : B 17 : 696
18 H : C 16 11 73 : B 24 : 808
19 L : C 30 16 54 : E 32 : 1103
20 T : H 45 22 32 : E 26 : 962
21 A : E 37 43 20 : E 27 : 1017
22 A : E 15 75 10 : E 37 : 1265
23 A : E 7 88 5 : E 38 : 1286
24 I : E 6 89 5 : E 41 : 1341
.
.
.
.
---------sample output ends here--------

* 4. CONTACT INFORMATION

Akira Kinjo
Center for Information Biology and DNA Data Bank of Japan,
National Institute of Genetics,
Mishima, 411-8540, JAPAN
email: akinjo @ genes . nig . ac . jp

* Appendix A.
The average contact number of each residue type is listed below:
-------------------
25.430 , /* A */
21.038 , /* R */
20.093 , /* N */
18.594 , /* D */
29.647 , /* C */
20.206 , /* Q */
18.008 , /* E */
22.505 , /* G */
23.572 , /* H */
29.469 , /* I */
28.173 , /* L */
18.452 , /* K */
26.466 , /* M */
28.057 , /* F */
20.350 , /* P */
21.420 , /* S */
22.747 , /* T */
26.913 , /* W */
26.627 , /* Y */
28.656 , /* V */
-------------------

* Appendix B.
** Faster but less accurate predictions.
The default implementation of CRNPRED uses 5000 dimensional state vectors for
critical random networks. This makes the prediction process quite slow when
you use the program on old computers or when you predict large proteins.

If you want predictions quickly, there are two options:
(1) linear predictor or (2) 2000 dimensional state vectors.

*** Using linear predictor
The linear predictor as described in Ref. 2 is implemented as a separate
program named "lpredm" which is installed along with xpredm (CRNPRED).
Use it as follows:
${CRNPRED_DIR}/bin/lpredm test.prof > test.out

*** Using 2000 dimensional CRNPRED
To use CRNPRED with 2000 dimensional state vectors, you need to recompile the
program. Do it as follows:

cd ${CRNPRED_DIR}/src
make realclean
make NDIM=2000 install
cd ..
cp w2000/WMATS .
cp w2000/WMAT_ENS .

This produces the executable file "xpredm" just like before, but it now
uses 2000-dimensional state vectors.

*** Comparison of predictors
Here is a brief summary of speed and accuracy of the linear predictor (lpredm),
xpredm(2000), and xpredm (5000). The CPU times were measured for
the sample file "sample/d8abp__.prof" (305 AA) on Mac OS X (PPC G5, 2.5GHz).
The CPU time is (almost) linearly proportional to the protein length.

program speed accuracy note
--------------------------------------------------------------
xpredm very slow SS:Q3=81 default
(5000) 5min52s CN:Cor=0.75
RWCO:Cor=0.61

xpredm slow SS:Q3=79
(2000) 1min12s CN:Cor=0.74
RWCO:Cor=0.61

lpredm fast SS:Q3=76
0.558s CN:Cor=0.72
RWCO:Cor=0.59
--------------------------------------------------------------

Note that the accuracies are the average values based on a benchmark.
The difference between Q3=81 and Q3=79 may seem insignificant on average, but
there can be a big difference for individual predictions [e.g., an incorrectly
predicted alpha helix with xpred(2000) may be correctly predicted as a beta
strand with xpred(5000)].

# Local variables:
# mode: outline
# End:

=]

Restore

Edit SideBar \| Edit Page \| Page History	Page last modified on April 11, 2006, at 01:08 AM	Recent Changes (All) \| Admin