[BiO BB] redundant data
dmb at mrc-dunn.cam.ac.uk
Fri Jan 9 06:28:03 EST 2004
> hi everybody,
> i have a set for 200 sequences where the sequence similarity varies between
> 28-90%. i want to select a representative set from this bigger set so that i pick
> up sequences which are representative of the whole set. ie from this bigger set i
> want to remove the sequences that are very similar and represent them by just a
> single sequence. ie i want to have a non redundant set. can anyone please tell how
> thanx in advance
One of the very quickest (and also easiest) ways to do this is using the excellent
program cd-hit ...
It should run in a couple of seconds for 200 sequences.
I have some perl scripts to parse the output into mysql (tab delimited) for easy
cluster analysis if you like.
There are a couple of small problems with this software which the author is aware of
but is too busy to fix. It would be nice to make this a project to develop the
Alternatively you can use blastclust, which does what its name suggests, but has an
extra 'coverage' parameter which is not explicitly present in cd-hit. It is slower,
but on 200 sequences it will still finish in around 1 min. Also blastclust allows an
arbitary sequence identity threshold for clustering, whereas cd-hit is limited to a
minimum of 40% identity.
On bigger sequence sets (>5,000) the fundamental differences between blastclust and
cd-hit make cd-hit a good choice.
With all sequence clustering algorithms you have to worry about 'the domain
problem', but I am not sure which technique currenly deals with this the best. I
know of one algorithm (DIVCLUS) which was explicitly designed to handle this
Park J, Teichmann SA.
DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains
in single- and multi-domain proteins. Bioinformatics. 1998;14(2):144-50.
> BiO_Bulletin_Board maillist - BiO_Bulletin_Board at bioinformatics.org
More information about the BBB