K-Means Clustering
Posted on Jun 14, 2018 in Notes • 10 min read
K-Means Clustering¶
An unsupervised clustering algorithm = the process of separating samples into groups
parameters
n_clusters
: number of centroids/clusters to generate.max_iter
: cap the maximum iterations of a KMean run, in case it takes too long to finish.n_init
: number of times KMeans runs with different centroids.
Returns the best outcome, i.e. one with the least inertia
(sum of squared errors between each sample and its respective centroid).init
: method for centroid initialisation. {'k-means++', 'random', custom ndarray}'k-means++'
: speeds up convergence'random'
: random
attributes
cluster_centers_
: coordinates of centroidslabels_
: labels of each pointinertia_
: sum of distances of samples to their cluster centroid
GOTCHAS
- verrrry fast that it runs
n_init
times - assumes your samples are length normalized, i.e. sensitive to feature scaling
- assumes roughly spherical and similar cluster sizes
In [ ]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(df)
#KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=5, n_init=10,
# n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)
labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_