
                                  CD-HIT

               Cluster Database at High Identity with Tolerance
                  http://bioinformatics.burnham-inst.org/cd-hi

================================================================================
  1.  Summary
  2.  Reference
  3.  Compile
  4.  Usage
  5.  Advanced configuration
================================================================================


 1. Summary
 ==========

 CD-HI/CD-HIT clusters protein sequence database at high sequence identity 
 threshold.  This program can remove the high sequence redundance efficiently.

 program written by
                                      Weizhong Li
                                      UCSD, San Diego Supercomputer Center
                                      La Jolla, CA, 92093

                                then  The Burnham Institute
                                      La Jolla, CA, 92037
                                      Email liwz@sdsc.edu

                                       
                 at
                                      Adam Godzik's lab
                                      The Burnham Institute
                                      La Jolla, CA, 92037
                                      Email adam@burnham-inst.org

  CD-HI is my first version, CD-HIT is modified from CD-HI. CD-HIT yields
  much higher speed than CD-HI, but user will have to tolerate a very
  small amount of redundant sequence in the output database. Since the
  amount of redundancy is so small, I suggest users use CD-HIT only for 
  all applications. Another reason is that I am only maintaining CD-HIT.

  For your information, below is the performance of CD-HIT/CD-HI

            CPU time comparison of CD-HIT and CD-HI 
     (Cluster NR Sep,2000 on my linux with 1G memory and 1G-Hz PIII processor)
================================================================================
                     CD-HIT         vs          CD-HI
Threshold
@90%                 55m            ||          55m
@80%                 29m            ||          30m
@75%                 30m            ||        1130m
@70%                 32m            ||        4151m
@65%                 90m            ||        haven't tested (should be 1 week)
@60%                500m            ||         
@50%                  5days         ||         
*note below threshold 80%, the time is from NR90->NRxx


  2. Reference
  ============

  Please cite:

  Weizhong Li, Lukasz Jaroszewski & Adam Godzik
  "Clustering of highly homologous sequences to reduce the
   size of large protein database",
  Bioinformatics, (2001) 17:282-283

  Weizhong Li, Lukasz Jaroszewski & Adam Godzik
  "Tolerating some redundancy significantly speeds up clustering
   of large protein databases",
  Bioinformatics, (2002) 18:77-82



  3. Compile
  ==========

  Just type "make" or edit Makefile before that if you want.
  After make, two excutables cd-hit & mcd-hit will appear,
  cd-hit is the program suitable for word-size > 3 and
  mcd-hit is good for word-size = 2,3 (see below to lear about word-size).


  4. Usage
  ========

  4.1 options
  -----------
  Type ./cd-hit or ./mcd-hit to print options

      -i filename of input database in fasta format, required!
      -o filename of output database, required!
      -c cluster identity threshold, default => 0.9
      -b max allowed gap length for alignment, default => 20
      -M The available memory of your computer, default => 400 (M)
      -n word size, default => 5, ( see next section)
      -l length_of_throw_away_sequences, default 10
      -d length of description line in the .clstr file, default 20
      -t tolerance for redundance, default 2 ( see paper of ignore it)
      -u filename of an old dbname.clstr, for incremental update
         if an old NR clustered at 90% yields NR90 and NR90.clstr
         to cluster a larger NR at 90%, use -u NR90.clstr
      -h print this help

    Example:
      cd-hi -i nr -o nr90 -M 480 -n 5
        cluster nr at 90% threshold supposing the computer has 480M memory.

      cd-hi -i pdbaa -o pdbaa80 -c 0.8
        cluster pdbaa at 80% using word length of 4

      mcd-hi -i swiss_prot -o swiss_prot_75 -c 0.75 -n 3
        cluster swiss_prot at 75% using word length of 3,
        here, mcd-hi is used instead of cd-hi


  4.2 Optimized parameters
  ------------------------

  Threshold       parameters for CD-HIT           parameters for CD-HI

  90%             cd-hit -n 5 -c 0.90             cd-hi -n 5 -c 0.90
  85%             cd-hit -n 5 -c 0.85             cd-hi -n 5 -c 0.85
  80%             cd-hit -n 5 -c 0.80             cd-hi -n 4 -c 0.80
  75%             cd-hit -n 5 -c 0.75            mcd-hi -n 3 -c 0.75
  70%             cd-hit -n 5 -c 0.70            mcd-hi -n 3 -c 0.70
  65%             cd-hit -n 4 -c 0.65            mcd-hi -n 2 -c 0.65
  60%             cd-hit -n 4 -c 0.60
  55%             cd-hit -n 4 -c 0.55 or
                 mcd-hit -n 3 -c 0.55
  50%            mcd-hit -n 3 -c 0.50
  45%            mcd-hit -n 2 -c 0.45
  40%            mcd-hit -n 2 -c 0.40


  4.3 A trick for thresholds < 80%
  --------------------------------
  A two-step CD-HI or CD-HIT run can cluster database faster than a single run.
  for example to cluster NR at 65% by CD-HIT,

  cd-hit -i NR -o NR90 -n 5 -c 0.9 plus
  cd-hit -i NR90 -o NR65 -n 4 -c 0.65
    is faster than a single
  cd-hit -i NR -o NR65 -n 4 -c 0.65 only.


  4.4 Incremental update
  ----------------------
  If you clustered NR at 90% using
  "cd-hit -i NR_last_week -o NR_last_week_90 -n 5 -c 0.9" last week,
  for this week's NR, you can use
  "cd-hit -i NR_this_week -o NR_this_week_90 -n 5 -c 0.9 -u NR_last_week_90.clstr",


  5. Advanced configuration
  =========================
  There are several macro definition in "cd-hi.h", you may
  re-define them to fit your work.

  #define MAX_AA 23
    Max number of alphabet of amino acid.
    -- You don't need to change.
  
  #define MAX_UAA 21
    Max unique number of alphabet of amino acid.
    -- You don't need to change.
  
  #define MAX_SEQ 65536
    Max length of sequences. Currently, the longest sequence in NR is
    about 27000. It may change in the future. If the longest sequence is
    above 65535, please define LONG_SEQ, see comments in source code.
    -- Feel free the re-define them.
  
  #define MAX_DIAG 133000
    Used in dynamic programming, this number should be the double of MAX_SEQ.
    -- change it whenever you change MAX_SEQ.
  
  #define MAX_DES 300000
    Max length of description line of each sequence. Maybe some descriptions
    will be longer then 60000, but
    -- You don't need to change.
  
  #define MAX_GAP 65536
    Max allowed gap length in dynamic programming subroutine.
    The user defined gap length by option -b should be smaller than it.
    --- It is up to you.
  
  #define MAX_LINE_SIZE 300000
    Max allowed length of a single line from input FASTA file.
    -- Feel free the re-define them.
  
  #define MAX_FILE_NAME 1280
    Max allowed length of filename.
    -- Feel free the re-define them.
  
  #define MAX_SEG 50
    For large database, the program divide it into several parts,
    this number is max allowed No. of parts.          
    -- Feel free the re-define them.                  


