
                  CD-HI, Program algorithm and implementation
                  ===========================================

This is the detailed algorithm of cd-hi. The reference is:

  Clustering of highly homologous sequences to reduce
  the size of large protein databases by 
  Weizhong Li, Lukasz Jaroszewski & Adam Godzik
  Bioinformatics (2000) accepted

Please first read above paper for background of this program.


1. Greedy incremental algorithm to cluster sequence databases
=============================================================

  This algorithm has been used to create the PDB_select and nrdb90 by
Holm et al. First, sequences are sorted in order of decreasing length.
The longest one becomes the representative of the first cluster. 
Then each remaining sequence is compared to the existing representatives. 
If the sequence similarity with any representative is above a given threshold, 
it is included into that cluster. Otherwise a new cluster starts 
with it as representative.


2. short words filter and index table
==============================

  Here, we use pentapeptides as example of short words. 
  
2.1 decomposition of sequences

  The sequences can be decomposed into short words, for example,
ACDEFGHIKLMN has these 
ACDEF CDEFG DEFGH EFGHI FGHIK GHIKL HIKLM and IKLMN 9 words

2.2 short word filter 

  If two proteins share certain sequence identity, they should have 
at least a certain number of identical pentapeptide.  For example, 
two sequences having 85% identical residues over a 100-residue
window will have at least 25 pentapeptides.

  So for a sequence of certain length, a threshold of sequence identity
is related with a threshold of the number of identical pentapeptide.

  Before two sequences are aligned, they are first decomposed into
pentapeptides, if their identical pentapeptide is lower than required
number, they don't need to be aligned.

2.3 index table

  For pentapeptide, since each residue can have 21 possibilities, 
there are will be 21x21x21x21x21 (about 4 millon) of them.
An index table with each entry representing a unique pentapeptide
is generated, Each entry in the index table holds numbers of the 
representative sequences containing this pentapeptide.

So an index table is like
pentapeptide	content
index
-------------------------------------------------------------------------
AAAAA	|	sequence #2 has 1, sequence #20 has 2, and so on.
AAAAR	|	sequence #15 has 2, sequence #22 has 1, and so on.
...	|	...
...	|	...
...	|	...
WHCKL	|	no sequence contains this pentapeptide.
...	|	...
...	|	...
-------------------------------------------------------------------------


3. detailed implementation of this algorithm
============================================

  Let's say, we have a input database below: (but note the input database
for the program should be in fasta format)
1 seq1 MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDID
2 seq2 MSHHWGYGKHNGPEHWHKDFPIAKGERQMD
3 seq3 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLIAGNGSLNHEPPPPPP
4 seq4 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLLLGG
5 seq5 DGDSRKVNLGVGAYRTDEGQPWVH
6 seq6 PTGTDPTPDEWKQIAAVMKRRC
7 seq7 SGNVPVLNAGDGSNQHPTQTLLDLFTIQQTEGRLDNLHVAM
8 seq8 LPRVDEIATDVDKTPHAWYFQQAGN


3.1 sort

  The sequences are sorted by length, so database is arranged as following:
1 seq3 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLIAGNGSLNHEPPPPPP
2 seq7 SGNVPVLNAGDGSNQHPTQTLLDLFTIQQTEGRLDNLHVAM
3 seq4 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLLLGG
4 seq1 MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDID
5 seq2 MSHHWGYGKHNGPEHWHKDFPIAKGERQMD
6 seq8 LPRVDEIATDVDKTPHAWYFQQAGN
7 seq5 DGDSRKVNLGVGAYRTDEGQPWVH
8 seq6 PTGTDPTPDEWKQIAAVMKRRC


3.2 Adding longest sequence,

  The longest sequence, seq3, is selected as the representative of cluster 1.
Also it is decomposed into pentapeptides. For example, this sequence has 2
pentapeptides of PPPPP. These short words are added into the index table.
so in the entry of PPPPP, the program will record that seq3 has 2 PPPPPs.

3.3 Processing new sequences

  For every new sequence, the program handles it in following steps.

3.3.1 according to the threshold and length of the new sequence, calculate 
      the required number of identical pentapeptides.
3.3.2 decompose it into pentapeptides.
3.3.3 scan the index table, check the entries of pentapeptides the new
      sequence has. For every old representative,  calculate the number 
      of identical pentapeptides between it and the new sequence. 
3.3.4 If the number of identical pentapeptides between an old representative
      and the new sequence is larger than the required, they are aligned to
      calculated their real sequence identity. Otherwise their identity should
      below the threshold.
3.3.5 If the sequence identity with any representative is above the threshold,
      the new sequence is included into that cluster. 
3.3.6 IF no conditions in 3.3.5 is true, that means the new sequence is
      not similar to any old cluster, it becames the representative of a
      new cluster. And the short words of it are added into the index table.

3.4 New database.

  This precess from 3.3.1 to 3.3.6 is repeated until all the sequences
are processed. All the representatives form the new clustered database.

If the threshold is 90%, the output database will like
cluster1 seq3 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLIAGNGSLNHEPPPPPP
cluster2 seq7 SGNVPVLNAGDGSNQHPTQTLLDLFTIQQTEGRLDNLHVAM
cluster3 seq1 MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDID
cluster4 seq8 LPRVDEIATDVDKTPHAWYFQQAGN
cluster5 seq6 PTGTDPTPDEWKQIAAVMKRRC

The following sequences are deleted because
seq2 MSHHWGYGKHNGPEHWHKDFPIAKGERQMD        similar to seq1, cluster3
seq4 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLLLGG similar to seq3, cluster1
seq5 DGDSRKVNLGVGAYRTDEGQPWVH              similar to seq3, cluster1


=========================================================================
                                             Comments to the author
                                             Weizhong Li at liwz@sdsc.edu
=========================================================================
