Module: kMeans

Table of Contents

Module: kMeans

Bio/Tools/Clustering/kMeans.py

This provides code for doing k-Means clustering of data.

k-Means is an algorithm for unsupervised clustering of data.

Glossary: clusters - A group of closely related data. centroids - A vector "in the middle" of a cluster.

Functions: cluster Cluster a list of data points.

Imported modules

from Bio.Tools import listfns, distance
import ckMeans
import random

Functions

_find_closest_centroid
cluster
first_k_points_as_centroids
random_centroids

_find_closest_centroid

_find_closest_centroid (
        vector,
        centroids,
        distance_fn,
        )

_find_closest_centroid(vector, centroids, distance_fn) -> index of closest centroid

cluster

cluster (
        data,
        k,
        distance_fn=distance.euclidean,
        init_centroids_fn=random_centroids,
        max_iterations=1000,
        update_fn=None,
        )

cluster(data, k[, distance_fn][, max_iterations][, update_fn]) -> (centroids, clusters) or None

Organize data into k clusters. Return a list of cluster assignments between 0-(k-1), where the items in the list corresponds to the list of data points. If the algorithm does not converge by max_iterations (default is 1000), returns None. data is a list of data points, which are vectors of numbers. distance_fn is a callback function that calculates the distance between two vectors. By default, the Euclidean distance wwill be used. If update_fn is specified, it is called at the beginning of every iteration and passed the iteration number, cluster centroids, and current cluster assignments.

Exceptions
Exceptions	ValueError, "All data should have the same dimensionality." ValueError, "Please pass in some data." ValueError, "Please specify a positive number of clusters." ValueError, "Please specify more data points than clusters." ValueError, "You should have at least one iteraction."

first_k_points_as_centroids

first_k_points_as_centroids ( data,  k )

first_k_points_as_centroids(data, k) -> list of centroids

Picks the first K points as the initial centroids. This isn't a good method (unless the data is randomized), but does provide determinism that's useful for debugging.

random_centroids

random_centroids ( data,  k )

random_centroids(data, k) -> list of centroids

Return a list of data points to serve as the initial centroids. This is k randomly chosen data points. Tries to avoid having repeated centroies, if possible.

Exceptions
Exceptions	ValueError, "k is larger than the number of data points"

Table of Contents

This document was automatically generated on Mon Jul 1 12:03:05 2002 by HappyDoc version 2.0.1