K-Neighbors Classification

Posted on Jun 14, 2018 in Notes • 11 min read

K-Neighbors Classification

A supervised classification algorithm = the process of assigning samples into groups

It is a parameter free approach to classification. So for example, you don't have to worry about things like your data being linearly separable or not.

__init__(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=1)

parameters

  • n_neighbors: (int) The number of neighbors to consider. Keep it odd when doing binary classification, particularly when you use uniform weighting.
  • weights: (str) How to count the votes from the neighbors; does everyone get an equal vote, a weighted vote, or something else?
  • algorithm: {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’} Select an optimization method for searching through your training data to find the nearest neighbors.

GOTCHAS

  • use an odd number
  • Sensitive to feature scaling
  • Sensitive to perturbations and the local structure of your dataset, particularly at lower "K" values.
  • Data needs to be measurable. Metric for discerning distance between your features needed.
  • With large "K" values, you have to be more cautious of the overall class distribution of your samples. Unjust preference might be given to classes taking up especially high percentages.

DOCUMENTATION

Separating out the 'labels' of the data

In [ ]:
# Process:
# Load a dataset into a dataframe
X = pd.read_csv('data.set', index_col=0)

# Do basic wrangling, but no transformations
# ...

# Immediately copy out the classification / label / class / answer column
y = X['classification'].copy()
X.drop(labels=['classification'], inplace=True, axis=1)

# Feature scaling as necessary
# ...

# Machine Learning
# ...

# Evaluation
# ...
In [ ]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y) 

# You can pass in a dframe or an ndarray
knn.predict([[1.1]])

knn.predict_proba([[0.9]])

# Returns the mean accuracy on the given test data and labels.
knn.score(X, y)