[CD-HIT] clustering nt database

Sun Sep 13 23:36:35 EDT 2009

2009/9/14 Ryan Golhar <golharam at umdnj.edu>:
>
>>> I'm using cd-hit-est because the documentation says thats the only one
>>> that
>>> works on DNA sequences.  The documentation talks about protein sequences
>>> for
>>> the rest of the programs.  Is this not the case?
>>
>> Right. What a bad memory I have! It's been some years since I used
>> cd-hit, and I forgot that it is protein specific.
>>
>> It may be worth trying to run it anyway... I'd imagine that the k-mer
>> analysis is still sound on DNA strings.
>>
>>
>
> Here is what I am getting when I try to use cd-hit:
>
> [golharam at hydrogen cd-hit-2009-0427]$ ./cd-hit -i /tmp/nt.1000 -o /tmp/nt90
> -c 0.9
> total seq: 1000
>
> Warning
> Some seqs longer than 65536, you may define LONG_SEQ
>
> It is not fatal, but may affect your results !!
>
> longest and shortest : 163353 and 21
> Total letters: 13083790
> Sequences have been sorted
>
> Fatal Error
> in diag_test_aapn, MAX_DIAG reached
>
> Program halted !!
>
> I did define LONG_SEQ in cd-hi.h, but still get the same error.  I suspect
> the sequences are just too long.

I guess this could be a problem. Its not a general solution, but IIRC
only a tiny fraction of protein sequences are > 65536 ... is this true
with your data? Would it be hard to remove them?

If you have bioperl installed it should be straightforward to do.
However, if your dataset is a set of many large nucleotide sequences,
I guess this isn't an option.

Dan.

> Ryan
>