Welcome to the CD-HIT Project Main Page

News (September 2009) CD-HIT web server is now available to run cd-hit or download pre-calculated clusters.

CD-HIT stands for Cluster Database at High Identity with Tolerance. The program (cd-hit) takes a fasta format sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output. In addition cd-hit outputs a cluster file, documenting the sequence 'groupies' for each nr sequence representative. The idea is to reduce the overall size of the database without removing any sequence information by only removing 'redundant' (or highly similar) sequences. This is why the resulting database is called non-redundant (nr). Essentially, cd-hit produces a set of closely related protein families from a given fasta sequence database.

CD-HIT uses a 'longest sequence first' list removal algorithm to remove sequences above a certain identity threshold. Additionally the algorithm implements a very fast heuristic to find high identity segments between sequences, and so can avoid many costly full alignments.

With recent developments, cd-hit package offers new programs for DNA sequence clustering and comparing two databases. It also has lots of new options for clustering control.

CD-HIT was originally written by Weizhong Li and is now an open source project!

Bugs

There are a number of outstanding bugs in the current implementation. We are always looking for hard working and enthusiastic volunteers (people like Luc Ducazu) to shoot these problems down.

Sub Projects

The CD-HIT project provides a number of opportunities for interesting research activities. If one of these sub-projects takes your interest why not join up and take part? We are especially keen to work closely with bioinformatics MSc students working on their MSc projects.

MyCD-HIT. A CD-HIT implementation embedded in a MySQL UDF!
Clustering Benchmarks. Develop and implement benchmarks to test clustering behavior.
CD-HIT CGI. On-line access to the algorithm.

Related Resources

For related resources, please see (or update) sequence clustering

Thanks

Many thanks are due.

Comments and suggestions

Contents

Welcome to the CD-HIT Project Main Page

Bugs

Sub Projects

Related Resources

Thanks