[BiO BB] redundant data
Dan Bolser
dmb at mrc-dunn.cam.ac.uk
Fri Jan 9 06:28:03 EST 2004
++ Pankaj--
> hi everybody,
> i have a set for 200 sequences where the sequence similarity varies between
> 28-90%. i want to select a representative set from this bigger set so that i pick
> up sequences which are representative of the whole set. ie from this bigger set i
> want to remove the sequences that are very similar and represent them by just a
> single sequence. ie i want to have a non redundant set. can anyone please tell how
> thanx in advance
> pankaj
One of the very quickest (and also easiest) ways to do this is using the excellent
program cd-hit ...
http://bioinformatics.ljcrf.edu/cd-hi/
It should run in a couple of seconds for 200 sequences.
I have some perl scripts to parse the output into mysql (tab delimited) for easy
cluster analysis if you like.
There are a couple of small problems with this software which the author is aware of
but is too busy to fix. It would be nice to make this a project to develop the
software here.
Alternatively you can use blastclust, which does what its name suggests, but has an
extra 'coverage' parameter which is not explicitly present in cd-hit. It is slower,
but on 200 sequences it will still finish in around 1 min. Also blastclust allows an
arbitary sequence identity threshold for clustering, whereas cd-hit is limited to a
minimum of 40% identity.
On bigger sequence sets (>5,000) the fundamental differences between blastclust and
cd-hit make cd-hit a good choice.
With all sequence clustering algorithms you have to worry about 'the domain
problem', but I am not sure which technique currenly deals with this the best. I
know of one algorithm (DIVCLUS) which was explicitly designed to handle this
problem,
Park J, Teichmann SA.
DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains
in single- and multi-domain proteins. Bioinformatics. 1998;14(2):144-50.
http://bioinformatics.oupjournals.org/cgi/pmidlookup?view=reprint&pmid=9545446
Ta,
Dan.
> _______________________________________________
> BiO_Bulletin_Board maillist - BiO_Bulletin_Board at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
More information about the BBB
mailing list