
                                  CD-HI

                       Cluster Database at High Identity
                  http://bioinformatics.burnham-inst.org/cd-hi

================================================================================
  1.  Summary
  2.  Reference
  3.  Compile
  4.  Usage
  5.  Advanced configuration
================================================================================

 1. Summary
 ==========

 CD-HI clusters protein sequence database at high sequence identity threshold.
 This program can remove the high sequence redundance efficiently.
 Note, here, high identity means 70% and up.

 program written by
                                      Weizhong Li
                                      UCSD, San Diego Supercomputer Center
                                      La Jolla, CA, 92093
                                      Email liwz@sdsc.edu

                 at
                                      Adam Godzik's lab
                                      The Burnham Institute
                                      La Jolla, CA, 92037
                                      Email adam@burnham-inst.org


  2. Reference
  ============

  Please cite:
  Weizhong Li, Lukasz Jaroszewski & Adam Godzik
  "Clustering of highly homologous sequences to reduce the 
   size of large protein database",
  Bioinformatics, (2001) 17:282-283


  3. Compile
  ==========

  Two programs are distributed, cd-hi.c++ and mcd-hi.c++. The latter is
  modified version of cd-hi.c++. It is better for clustering of large databases
  (like SWISS-PROT and NR) with short word length of 2 or 3 (see below).

  No complicate Makefile for CD-HI, just a single command. I tested this
  program on Redhat Linux, the compiler is g++:

  Compile:
      g++ -o cd-hi -O cd-hi.c++   or
      g++ -o mcd-hi -O mcd-hi.c++

  !! Note the -O option (or -O2 -O3) make the program 2-3 times faster!
  Please check with your compiler for the -O (or the like) options

  
  4. Usage
  ========

    cd-hi [options]

    Options:
      -i filename of input database in fasta format, required!

      -o filename of output database, required!

      -c cluster identity threshold, default => 0.9

      -b max allowed gap length for alignment, default => 20

      -M The available memory of your computer, default => 400 (M)

      -n word size, default => 4,
              The longer the word size is, the faster the program is
              but, the word size if restricted by the cluster threshold
              threshold       allowed word size       good word size
              >=85%           3,4,5                   5
              >=80%           3,4                     4
              >=75%           3                       3
              >=70%           2,3                     3
              >=60%           2                       2
              if you are clustering nr at 90% use -n 5

    Example:
      cd-hi -i nr -o nr90 -M 480 -n 5
        cluster nr at 90% threshold supposing the computer has 480M memory.

      cd-hi -i pdbaa -o pdbaa80 -c 0.8
        cluster pdbaa at 80% using word length of 4

      mcd-hi -i swiss_prot -o swiss_prot_75 -c 0.75 -n 3
        cluster swiss_prot at 75% using word length of 3,
        here, mcd-hi is used instead of cd-hi

  5. Advanced configuration
  =========================

  There are several macro definition at the top of the program, you may 
  re-define them to fit your work.

  #define MAX_AA 23
    Max number of alphabet of amino acid.
    -- You don't need to change.

  #define MAX_UAA 21
    Max unique number of alphabet of amino acid.
    -- You don't need to change.

  #define MAX_SEQ 30000
    Max length of sequences. Currently, the longest sequence in NR is
    about 27000. It may change in the future. If the longest sequence is
    above 65535, please define LONG_SEQ, see comments in source code.
    -- Feel free the re-define them.

  #define MAX_DIAG 60000
    Used in dynamic programming, this number should be the double of MAX_SEQ.
    -- change it whenever you change MAX_SEQ.    

  #define MAX_DES 60000
    Max length of description line of each sequence. Maybe some descriptions
    will be longer then 60000, but
    -- You don't need to change.

  #define MAX_GAP 3000
    Max allowed gap length in dynamic programming subroutine.
    The user defined gap length by option -b should be smaller than it.
    --- It is up to you.

  #define MAX_LINE_SIZE 60000
    Max allowed length of a single line from input FASTA file.
    -- Feel free the re-define them.

  #define MAX_FILE_NAME 1280
    Max allowed length of filename.
    -- Feel free the re-define them.

  #define SEQ_NO 600000
    Max allowed No. of sequences. On Sep 20, 2000. the nr has about
    560,000 sequences.
    -- Feel free the re-define them.

  #define MAX_SEG 50
    For large database, the program divide it into several parts,
    this number is max allowed No. of parts.
    -- Feel free the re-define them.

  #define BUFFER 100000
    The size of buffer used by program.
    -- You don't need to change.


