ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/cd-hit/Algorithm.cd-hi
Revision: 1.1.1.1 (vendor branch)
Committed: Sat Feb 7 10:55:46 2004 UTC (17 years, 8 months ago) by dmb
Branch: main, MAIN
CVS Tags: start, HEAD
Changes since 1.1: +0 -0 lines
Log Message:
First import

Line File contents
1
2 CD-HI, Program algorithm and implementation
3 ===========================================
4
5 This is the detailed algorithm of cd-hi. The reference is:
6
7 Clustering of highly homologous sequences to reduce
8 the size of large protein databases by
9 Weizhong Li, Lukasz Jaroszewski & Adam Godzik
10 Bioinformatics (2000) accepted
11
12 Please first read above paper for background of this program.
13
14
15 1. Greedy incremental algorithm to cluster sequence databases
16 =============================================================
17
18 This algorithm has been used to create the PDB_select and nrdb90 by
19 Holm et al. First, sequences are sorted in order of decreasing length.
20 The longest one becomes the representative of the first cluster.
21 Then each remaining sequence is compared to the existing representatives.
22 If the sequence similarity with any representative is above a given threshold,
23 it is included into that cluster. Otherwise a new cluster starts
24 with it as representative.
25
26
27 2. short words filter and index table
28 ==============================
29
30 Here, we use pentapeptides as example of short words.
31
32 2.1 decomposition of sequences
33
34 The sequences can be decomposed into short words, for example,
35 ACDEFGHIKLMN has these
36 ACDEF CDEFG DEFGH EFGHI FGHIK GHIKL HIKLM and IKLMN 9 words
37
38 2.2 short word filter
39
40 If two proteins share certain sequence identity, they should have
41 at least a certain number of identical pentapeptide. For example,
42 two sequences having 85% identical residues over a 100-residue
43 window will have at least 25 pentapeptides.
44
45 So for a sequence of certain length, a threshold of sequence identity
46 is related with a threshold of the number of identical pentapeptide.
47
48 Before two sequences are aligned, they are first decomposed into
49 pentapeptides, if their identical pentapeptide is lower than required
50 number, they don't need to be aligned.
51
52 2.3 index table
53
54 For pentapeptide, since each residue can have 21 possibilities,
55 there are will be 21x21x21x21x21 (about 4 millon) of them.
56 An index table with each entry representing a unique pentapeptide
57 is generated, Each entry in the index table holds numbers of the
58 representative sequences containing this pentapeptide.
59
60 So an index table is like
61 pentapeptide content
62 index
63 -------------------------------------------------------------------------
64 AAAAA | sequence #2 has 1, sequence #20 has 2, and so on.
65 AAAAR | sequence #15 has 2, sequence #22 has 1, and so on.
66 ... | ...
67 ... | ...
68 ... | ...
69 WHCKL | no sequence contains this pentapeptide.
70 ... | ...
71 ... | ...
72 -------------------------------------------------------------------------
73
74
75 3. detailed implementation of this algorithm
76 ============================================
77
78 Let's say, we have a input database below: (but note the input database
79 for the program should be in fasta format)
80 1 seq1 MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDID
81 2 seq2 MSHHWGYGKHNGPEHWHKDFPIAKGERQMD
82 3 seq3 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLIAGNGSLNHEPPPPPP
83 4 seq4 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLLLGG
84 5 seq5 DGDSRKVNLGVGAYRTDEGQPWVH
85 6 seq6 PTGTDPTPDEWKQIAAVMKRRC
86 7 seq7 SGNVPVLNAGDGSNQHPTQTLLDLFTIQQTEGRLDNLHVAM
87 8 seq8 LPRVDEIATDVDKTPHAWYFQQAGN
88
89
90 3.1 sort
91
92 The sequences are sorted by length, so database is arranged as following:
93 1 seq3 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLIAGNGSLNHEPPPPPP
94 2 seq7 SGNVPVLNAGDGSNQHPTQTLLDLFTIQQTEGRLDNLHVAM
95 3 seq4 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLLLGG
96 4 seq1 MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDID
97 5 seq2 MSHHWGYGKHNGPEHWHKDFPIAKGERQMD
98 6 seq8 LPRVDEIATDVDKTPHAWYFQQAGN
99 7 seq5 DGDSRKVNLGVGAYRTDEGQPWVH
100 8 seq6 PTGTDPTPDEWKQIAAVMKRRC
101
102
103 3.2 Adding longest sequence,
104
105 The longest sequence, seq3, is selected as the representative of cluster 1.
106 Also it is decomposed into pentapeptides. For example, this sequence has 2
107 pentapeptides of PPPPP. These short words are added into the index table.
108 so in the entry of PPPPP, the program will record that seq3 has 2 PPPPPs.
109
110 3.3 Processing new sequences
111
112 For every new sequence, the program handles it in following steps.
113
114 3.3.1 according to the threshold and length of the new sequence, calculate
115 the required number of identical pentapeptides.
116 3.3.2 decompose it into pentapeptides.
117 3.3.3 scan the index table, check the entries of pentapeptides the new
118 sequence has. For every old representative, calculate the number
119 of identical pentapeptides between it and the new sequence.
120 3.3.4 If the number of identical pentapeptides between an old representative
121 and the new sequence is larger than the required, they are aligned to
122 calculated their real sequence identity. Otherwise their identity should
123 below the threshold.
124 3.3.5 If the sequence identity with any representative is above the threshold,
125 the new sequence is included into that cluster.
126 3.3.6 IF no conditions in 3.3.5 is true, that means the new sequence is
127 not similar to any old cluster, it becames the representative of a
128 new cluster. And the short words of it are added into the index table.
129
130 3.4 New database.
131
132 This precess from 3.3.1 to 3.3.6 is repeated until all the sequences
133 are processed. All the representatives form the new clustered database.
134
135 If the threshold is 90%, the output database will like
136 cluster1 seq3 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLIAGNGSLNHEPPPPPP
137 cluster2 seq7 SGNVPVLNAGDGSNQHPTQTLLDLFTIQQTEGRLDNLHVAM
138 cluster3 seq1 MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDID
139 cluster4 seq8 LPRVDEIATDVDKTPHAWYFQQAGN
140 cluster5 seq6 PTGTDPTPDEWKQIAAVMKRRC
141
142 The following sequences are deleted because
143 seq2 MSHHWGYGKHNGPEHWHKDFPIAKGERQMD similar to seq1, cluster3
144 seq4 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLLLGG similar to seq3, cluster1
145 seq5 DGDSRKVNLGVGAYRTDEGQPWVH similar to seq3, cluster1
146
147
148 =========================================================================
149 Comments to the author
150 Weizhong Li at liwz@sdsc.edu
151 =========================================================================