K-Means Clustering

Posted on Jun 14, 2018 in Notes • 10 min read

K-Means Clustering

An unsupervised clustering algorithm = the process of separating samples into groups

parameters

  • n_clusters: number of centroids/clusters to generate.
  • max_iter: cap the maximum iterations of a KMean run, in case it takes too long to finish.
  • n_init: number of times KMeans runs with different centroids.
    Returns the best outcome, i.e. one with the least inertia
    (sum of squared errors between each sample and its respective centroid).
  • init: method for centroid initialisation. {'k-means++', 'random', custom ndarray}
    • 'k-means++': speeds up convergence
    • 'random': random

attributes

  • cluster_centers_: coordinates of centroids
  • labels_: labels of each point
  • inertia_: sum of distances of samples to their cluster centroid

GOTCHAS

  • verrrry fast that it runs n_init times
  • assumes your samples are length normalized, i.e. sensitive to feature scaling
  • assumes roughly spherical and similar cluster sizes
In [ ]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(df)
#KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=5, n_init=10, 
#       n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)
labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_