[CD-HIT] CD-HIT - Unusual Clustering

Thu Nov 11 05:17:31 EST 2010

Hi Rob,

No I would not say that this result is strictly 'expected', but if you
look at the name of CD-HIT, the T signifies Tolerance, i.e. the
heuristic with CD-HIT uses can rarely split sequences with identity
greater than the threshold into separate clusters (that is the
trade-off you get for the incredible speed at which the heuristic can
generate clusters without doing pairwise alignments of all sequences
in the database).

http://www.ncbi.nlm.nih.gov/pubmed/11836214

I've included the cd-hit mailing list address in the reply to this
email. There may be people on that list who can give a much better
explaination than I can, and who can look into the example you
provided in more detail (I just help run the project page on
Bioinformatics.Org).

Thanks for providing feedback on CD-HIT!

All the best,
Dan.

On 10 November 2010 07:38, Rob Syme <rob.syme at gmail.com> wrote:
> Hi,
>
> I'm looking to cluster proteins from three fungal genomes. I've come across
> a curious result where two very similar sequences (attached) are not
> clustered:
>
> cd-hit -i Curious.fasta -o Curious.clusters
> cd-hit -i Curious.fasta -o Curious.clusters -c 0.7
>
> Both commands split the two sequences into two clusters, even though the
> alignment covers 93% of the longest protein with perfect identity.
>
> Is this the expected behaviour for CD-HIT?
> -r
>
> Rob Syme
> PhD Student
> ACNFP, Curtin University
> Western Australia
>
>
>