Table of Contents

Module: kMeans Bio/Tools/Clustering/kMeans.py

This provides code for doing k-Means clustering of data.

k-Means is an algorithm for unsupervised clustering of data.

Glossary: clusters - A group of closely related data. centroids - A vector "in the middle" of a cluster.

Functions: cluster Cluster a list of data points.

Imported modules   
from Bio.Tools import listfns, distance
import ckMeans
import random
Functions   
_find_closest_centroid
cluster
first_k_points_as_centroids
random_centroids
  _find_closest_centroid 
_find_closest_centroid (
        vector,
        centroids,
        distance_fn,
        )

_find_closest_centroid(vector, centroids, distance_fn) -> index of closest centroid

  cluster 
cluster (
        data,
        k,
        distance_fn=distance.euclidean,
        init_centroids_fn=random_centroids,
        max_iterations=1000,
        update_fn=None,
        )

cluster(data, k[, distance_fn][, max_iterations][, update_fn]) -> (centroids, clusters) or None

Organize data into k clusters. Return a list of cluster assignments between 0-(k-1), where the items in the list corresponds to the list of data points. If the algorithm does not converge by max_iterations (default is 1000), returns None. data is a list of data points, which are vectors of numbers. distance_fn is a callback function that calculates the distance between two vectors. By default, the Euclidean distance wwill be used. If update_fn is specified, it is called at the beginning of every iteration and passed the iteration number, cluster centroids, and current cluster assignments.

Exceptions   
ValueError, "All data should have the same dimensionality."
ValueError, "Please pass in some data."
ValueError, "Please specify a positive number of clusters."
ValueError, "Please specify more data points than clusters."
ValueError, "You should have at least one iteraction."
  first_k_points_as_centroids 
first_k_points_as_centroids ( data,  k )

first_k_points_as_centroids(data, k) -> list of centroids

Picks the first K points as the initial centroids. This isn't a good method (unless the data is randomized), but does provide determinism that's useful for debugging.

  random_centroids 
random_centroids ( data,  k )

random_centroids(data, k) -> list of centroids

Return a list of data points to serve as the initial centroids. This is k randomly chosen data points. Tries to avoid having repeated centroies, if possible.

Exceptions   
ValueError, "k is larger than the number of data points"

Table of Contents

This document was automatically generated on Mon Jul 1 12:03:05 2002 by HappyDoc version 2.0.1