ViewVC Help
View File | Revision Log | Show Annotations | Root Listing
root/cd-hit/Algorithm
Revision: 1.1.1.1 (vendor branch)
Committed: Sat Feb 7 10:55:46 2004 UTC (17 years, 8 months ago) by dmb
Branch: main, MAIN
CVS Tags: start, HEAD
Changes since 1.1: +0 -0 lines
Log Message:
First import

Line File contents
1
2 CD-HIT
3
4 Cluster Database at High Identity with Tolerance
5 http://bioinformatics.burnham-inst.org/cd-hi
6
7 ================================================================================
8 This program is modified from CD-HI, you may read algorithm.cd-hi first.
9 ================================================================================
10
11 The basic filter system of CD-HI states:
12
13 "If two proteins share certain sequence identity, they should have
14 at least a certain number of identical pentapeptide. For example,
15 two sequences having 85% identical residues over a 100-residue
16 window will have at least 25 pentapeptides."
17
18 Theoretically, two sequence have 80% identity, have don't need have a single
19 identical pentapeptides. They can differ every 4 amino-acid. like
20
21 MSHHWGYGKHNGPEMWHKDFPIAKGERQS....
22 MSHH GYGK NGPE WHKD PIAK ERQS....
23 MSHHcGYGKdNGPEhWHKDiPIAKtERQS....
24
25 But, this is very very rare in real world of alignments. Even the alignment
26 is at 60%. there are still some identical pentapeptides in general. This is
27 the basis of CD-HIT.
28
29 CD-HIT is based on the statistical analysis of a large mount of alignments.
30 While speeding up the program, it won't lose much of quality of clustering.
31