[CD-HIT] clustering nt database
Ryan Golhar
golharam at umdnj.edu
Sun Sep 13 23:32:16 EDT 2009
>> I'm using cd-hit-est because the documentation says thats the only one that
>> works on DNA sequences. The documentation talks about protein sequences for
>> the rest of the programs. Is this not the case?
>
> Right. What a bad memory I have! It's been some years since I used
> cd-hit, and I forgot that it is protein specific.
>
> It may be worth trying to run it anyway... I'd imagine that the k-mer
> analysis is still sound on DNA strings.
>
>
Here is what I am getting when I try to use cd-hit:
[golharam at hydrogen cd-hit-2009-0427]$ ./cd-hit -i /tmp/nt.1000 -o
/tmp/nt90 -c 0.9
total seq: 1000
Warning
Some seqs longer than 65536, you may define LONG_SEQ
It is not fatal, but may affect your results !!
longest and shortest : 163353 and 21
Total letters: 13083790
Sequences have been sorted
Fatal Error
in diag_test_aapn, MAX_DIAG reached
Program halted !!
I did define LONG_SEQ in cd-hi.h, but still get the same error. I
suspect the sequences are just too long.
Ryan
More information about the CD-HIT-l
mailing list