PCA Dimensionality Reduction
Posted on Jun 14, 2018 in Notes • 12 min read
PCA Dimensionality Reduction¶
An unsupervised learning technique that reduces the dimensionality of your datasets.
It models a linear subspace of your data by capturing its greatest variability.
It assumes a linear relationship between features.
Goal: captures your datasets most variant directions, i.e. feature inportance ranked by variance
under the hood: It accesses your dataset's covariance structure directly using matrix calculations and eigenvectors to compute the best unique features that describe your samples.
- be sure to use feature scaling (SciKit-Learn's StandardScaler is a good-fit for taking care of scaling your data before performing dimensionality reduction.)
- linear transformation only: rotate and translate the feature space of your samples, but will not skew them.
- lose all column header names after
.transform()
In [1]:
# General import
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from sklearn import preprocessing
plt.style.use('ggplot')
Feature scaling¶
set scaleFeatures = True
to activate feature scaling
In [2]:
def scaleFeaturesDF(df):
# Feature scaling is a type of transformation that only changes the
# scale, but not number of features. Because of this, we can still
# use the original dataset's column names... so long as we keep in
# mind that the _units_ have been altered:
scaled = preprocessing.StandardScaler().fit_transform(df)
scaled = pd.DataFrame(scaled, columns=df.columns)
print("New Variances:\n", scaled.var())
print("New Describe:\n", scaled.describe())
return scaled
scaleFeatures = False
Boilerplate Code for visualisation¶
In [3]:
def drawVectors(transformed_features, components_, columns, plt, scaled):
if not scaled:
return plt.axes() # No cheating ;-)
num_columns = len(columns)
# This funtion will project your *original* feature (columns)
# onto your principal component feature-space, so that you can
# visualize how "important" each one was in the
# multi-dimensional scaling
# Scale the principal components by the max value in
# the transformed set belonging to that component
xvector = components_[0] * max(transformed_features[:,0])
yvector = components_[1] * max(transformed_features[:,1])
## visualize projections
# Sort each column by it's length. These are your *original*
# columns, not the principal components.
important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
print("Features by importance:\n", important_features)
ax = plt.axes()
for i in range(num_columns):
# Use an arrow to project each original feature as a
# labeled vector on your principal component axes
plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)
return ax
PCA concept¶
In [4]:
def computePCA(input_data):
# Subtract the mean of each feature from itself
M = (input_data - input_data.T.mean(axis=1)).T
# Compute covariance matrix
C = np.cov(M)
# Compute eigen values + vectors of the matrix
latent, coeff = np.linalg.eig(C)
# Sort eigenvalues in order of greatest importance
index = np.argsort(latent[::-1])
# Sort eigenvectors by their sorted eigenvalues
coeff = coeff[:, index]
latent = latent[index]
return latent
SciKit-Learn PCA¶
n_components
: the number of dimensions you wish to keep. It should be <= original featuressvd_solver
: dictates if a full singular value decomposition should be preformed on your data, or a randomized truncated one.='full'
: razor-sharp, machine-precision matrix operations='randomized'
: sacrifice a bit of accuracy for computational efficiency
In [9]:
# Load and prep data
df = pd.read_csv('~/dat210x/DAT210x/Module4/Datasets/kidney_disease.csv', index_col='id')
df = df.dropna(axis=0)
labels = ['red' if i=='ckd' else 'green' for i in df.classification]
selected_col = ['bgr','wc','rc']
df[selected_col] = df[selected_col].apply(pd.to_numeric, errors='coerce', axis=1)
df = df[selected_col]
In [16]:
from sklearn.decomposition import PCA
# Create PCA instance
pca = PCA(n_components=2, svd_solver='full')
# One-liner for the following three lines
T = pca.fit_transform(df)
#pca.fit(df)
#PCA(copy=True, n_components=2, whiten=False)
#T = pca.transform(df)
Attributes of the PCA model after trained with .fit()
components_
: Principal component vectors that are linear combinations of the original features. As such, they exist within the feature space of your original dataset.explained_variance_
: Calculated amount of variance which exists in the newly computed principal components.explained_variance_ratio_
: Normalized version ofexplained_variance_
for when your interest is with probabilities.
Plotting transformed data¶
pca.transform
returns a NumPy NDArray
- plot NumPy array with Matplotlib
- convert NumPy Array data back to Pandas DataFrame and plot directly through Pandas
In [ ]:
type(T)
In [14]:
# 1. With Matplotlib
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_title('Plot with Matplotlib')
ax.scatter(T[:,0], T[:,1], c='blue', marker='.', alpha=0.75)
plt.show()
In [15]:
# 2. Convert back to Pandas DataFrame
if scaleFeatures: df = scaleFeaturesDF(df)
ax = drawVectors(T, pca.components_, df.columns.values, plt, scaleFeatures)
T = pd.DataFrame(T)
# Plot through Pandas
T.columns = ['component1', 'component2']
T.plot.scatter(x='component1', y='component2', marker='o', c=labels, alpha=0.75, ax=ax)
plt.show()