[BiO BB] Clustering

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Wed Sep 3 12:46:56 EDT 2003


> > What packages support clustering of points
> > with a with a similarity matrix?
> 
> I don't think I quite understand the question, can you elaborate on that?

Yup... I am always finding that I have some similarities between things,
and I would like to be able to do a simple clustering of the points,
but I am not familiar with the algoithms, so I would just like to play
around a bit.

I know you can do phylogenetic analysis on any similarity matrix, but
I don't need the high resolution (many similar points closly linked to
one short branch). I would like to generally see what 'blobs' of data
I have without investing too much time into the analysis (or the
computation!).

For example I might have the AA composition of 1000 sequences, and we
may suspect that the composition is biased across these sequences (not 
uniform). So we think - maby I should break up into secondary structure,
maby into families, maby I should perform chi-squaird between every
possible combination of groups of the 1000 to find sub populations within
which the composition isn't biased...

If I take each protein and compare it's composition to every other, I have
an N**2/2 similarity matrix, which I would like to cluster, just to see
if any protein families, structural classes or taxonomic groups have a
particular bias in terms of AA composition, but this is a long complicated
analysis (I think to myself), so I don't bother.

Now I ask I am sure there are 1000's of clustering toolkits out there, 
I should just google. Does anyone have any recomendations?


> > How can I derive the similarity of two matrices?
> > 
> 
> If you mean that you would like to check how "close" two similarity 
> matrices (e.g. BLOSUM, PAM) are to each other, then one method is to 
> compare the amino-acid pair frequency distributions used to construct 
> these matrices. 

You mean the similarity of two distributions? sounds interesting...

> Look to the following paper (fig 4, and the last 
> paragraph in the "methods" section) for one example on how to do this, 
> although other methods of comparing distributions may be used just as 
> effectively:
> 
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=retrieve&db=pubmed&list_uids=11790845&dopt=Abstract

Thanks very much,
Dan.


> ./I
> 
> 
> 
> 
> 




More information about the BBB mailing list