Hi Rob,<div><br></div><div>We have just made a developmental release on: <a href="http://cdhit.google.com">http://cdhit.google.com</a>. This release has included a few improvements on filtering process, the alignment band searching and the local band alignment computation. It should be more sensitive than the previous releases. Perhaps you can try this release.</div>

<div><br></div><div>Best regards,</div><div><br></div><div>Limin</div><div><br></div><div><br><br><div class="gmail_quote">On Thu, Nov 11, 2010 at 2:17 AM, Dan Bolser <span dir="ltr"><<a href="mailto:dmb@bioinformatics.org">dmb@bioinformatics.org</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hi Rob,<br>

<br>

No I would not say that this result is strictly 'expected', but if you<br>

look at the name of CD-HIT, the T signifies Tolerance, i.e. the<br>

heuristic with CD-HIT uses can rarely split sequences with identity<br>

greater than the threshold into separate clusters (that is the<br>

trade-off you get for the incredible speed at which the heuristic can<br>

generate clusters without doing pairwise alignments of all sequences<br>

in the database).<br>

<br>

<a href="http://www.ncbi.nlm.nih.gov/pubmed/11836214" target="_blank">http://www.ncbi.nlm.nih.gov/pubmed/11836214</a><br>

<br>

<br>

I've included the cd-hit mailing list address in the reply to this<br>

email. There may be people on that list who can give a much better<br>

explaination than I can, and who can look into the example you<br>

provided in more detail (I just help run the project page on<br>

Bioinformatics.Org).<br>

<br>

<br>

Thanks for providing feedback on CD-HIT!<br>

<br>

All the best,<br>

Dan.<br>

<br>

<br>

On 10 November 2010 07:38, Rob Syme <<a href="mailto:rob.syme@gmail.com">rob.syme@gmail.com</a>> wrote:<br>

> Hi,<br>

><br>

> I'm looking to cluster proteins from three fungal genomes. I've come across<br>

> a curious result where two very similar sequences (attached) are not<br>

> clustered:<br>

><br>

> cd-hit -i Curious.fasta -o Curious.clusters<br>

> cd-hit -i Curious.fasta -o Curious.clusters -c 0.7<br>

><br>

> Both commands split the two sequences into two clusters, even though the<br>

> alignment covers 93% of the longest protein with perfect identity.<br>

><br>

> Is this the expected behaviour for CD-HIT?<br>

> -r<br>

><br>

> Rob Syme<br>

> PhD Student<br>

> ACNFP, Curtin University<br>

> Western Australia<br>

><br>

><br>

><br>

<br>

_______________________________________________<br>

CD-HIT-l mailing list<br>

<a href="mailto:CD-HIT-l@bioinformatics.org">CD-HIT-l@bioinformatics.org</a><br>

<a href="http://www.bioinformatics.org/mailman/listinfo/cd-hit-l" target="_blank">http://www.bioinformatics.org/mailman/listinfo/cd-hit-l</a><br>

</blockquote></div><br></div>