[BiO BB] About clustering genes to gene family
dmb at mrc-dunn.cam.ac.uk
Fri Aug 8 07:02:06 EDT 2003
This method uses an all against all blast comparison as
input to the clustering. Can you really do that 'routinely'
with 500,000 sequences without dedicated hardware?
I guess once you have your initial 'pairs DB' you can then
add new sequences in without much work, and I guess the
actuall clustering is the 'efficient' part of the method.
The handling of multidomain proteins is interesting,
but I don't really see how it differs from demanding
a certain length of allignment within the family.
Although the technique is mathmatically clean,
it is a bit hazy when it comes to the multi domain
issue. I.e. if we have protein 1 with domains ABC,
what happens to protein 2 with domains AB?
What happens to the 'families' of type 1 and 2
in this strategy?
I love the extension of pairwise similarity to
group similarity using the network of blast
hits - that is really nice, but the biological
significance of the r factor (number of clusters)
is not investigated, which is a shame.
Anyone heard of BAG for domain decomposition
from such a network?
Thanks for the info,
Marcos Oliveira de Carvalho wrote:
>I use TribeMCL software with good results.
>Here is the URL -> http://www.ebi.ac.uk/research/cgg/tribe/
>And here is the abstract of the paper about TribeMCL:
>TribeMCL is a method for clustering proteins into related groups, which
>are termed 'protein families'. This clustering is achieved by analysing
>similarity patterns between proteins in a given dataset, and using these
>patterns to assign proteins into related groups. In many cases, proteins
>in the same protein familywill have similar functional properties.
>TribeMCL uses a novel clustering method (Markov Clustering or MCL) which
>solves problems which normally hinder protein sequence clustering. These
>problems include: multi-domain proteins, peptide fragments and proteins
>which possess domains which are very widespread (promiscuous domains). The
>efficiency of the method makes it applicable to the clustering of very
>large datasets. We routinely use the algorithm to cluster datasets as
>large as 500,000 peptides.
>On Thu, 7 Aug 2003, Zheng Fu wrote:
>>Does anyone know how to clustering genes to a gene family based on the
>>For two genes, we can define a threshold to seperate the homolog and
>>non-homolog. But for three or more genes,how to define the homologs?(Such
>>as Gene A and Gene B has high alignment score, A and C also has high sore,
>>but B and C doesn't have high socre, can we say ABC are homologs?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the BBB