This provides code for doing k-Means clustering of data.
k-Means is an algorithm for unsupervised clustering of data.
Glossary:
clusters - A group of closely related data.
centroids - A vector "in the middle" of a cluster.
Functions:
cluster Cluster a list of data points.
Imported modules
|
|
from Bio.Tools import listfns, distance
import ckMeans
import random
|
Functions
|
|
_find_closest_centroid
cluster
first_k_points_as_centroids
random_centroids
|
|
_find_closest_centroid
|
_find_closest_centroid (
vector,
centroids,
distance_fn,
)
_find_closest_centroid(vector, centroids, distance_fn) ->
index of closest centroid
|
|
cluster
|
cluster (
data,
k,
distance_fn=distance.euclidean,
init_centroids_fn=random_centroids,
max_iterations=1000,
update_fn=None,
)
cluster(data, k[, distance_fn][, max_iterations][, update_fn]) ->
(centroids, clusters) or None Organize data into k clusters. Return a list of cluster
assignments between 0-(k-1), where the items in the list
corresponds to the list of data points. If the algorithm does not
converge by max_iterations (default is 1000), returns None. data
is a list of data points, which are vectors of numbers.
distance_fn is a callback function that calculates the distance
between two vectors. By default, the Euclidean distance wwill be
used. If update_fn is specified, it is called at the beginning of
every iteration and passed the iteration number, cluster
centroids, and current cluster assignments.
Exceptions
|
|
ValueError, "All data should have the same dimensionality."
ValueError, "Please pass in some data."
ValueError, "Please specify a positive number of clusters."
ValueError, "Please specify more data points than clusters."
ValueError, "You should have at least one iteraction."
|
|
|
first_k_points_as_centroids
|
first_k_points_as_centroids ( data, k )
first_k_points_as_centroids(data, k) -> list of centroids
Picks the first K points as the initial centroids. This isn't a
good method (unless the data is randomized), but does provide
determinism that's useful for debugging.
|
|
random_centroids
|
random_centroids ( data, k )
random_centroids(data, k) -> list of centroids
Return a list of data points to serve as the initial centroids.
This is k randomly chosen data points. Tries to avoid having
repeated centroies, if possible.
Exceptions
|
|
ValueError, "k is larger than the number of data points"
|
|
|