K-Means Clustering

Posted on Jun 14, 2018 in Notes • 10 min read

K-Means Clustering¶

An unsupervised clustering algorithm = the process of separating samples into groups

parameters

n_clusters: number of centroids/clusters to generate.
max_iter: cap the maximum iterations of a KMean run, in case it takes too long to finish.
n_init: number of times KMeans runs with different centroids.
Returns the best outcome, i.e. one with the least inertia
(sum of squared errors between each sample and its respective centroid).
init: method for centroid initialisation. {'k-means++', 'random', custom ndarray}
- 'k-means++': speeds up convergence
- 'random': random

attributes

cluster_centers_: coordinates of centroids
labels_: labels of each point
inertia_: sum of distances of samples to their cluster centroid

GOTCHAS

verrrry fast that it runs n_init times
assumes your samples are length normalized, i.e. sensitive to feature scaling
assumes roughly spherical and similar cluster sizes

In [ ]:

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(df)
#KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=5, n_init=10, 
#       n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)
labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_

K-Means Clustering¶

You might enjoy