PCA Dimensionality Reduction

Posted on Jun 14, 2018 in Notes • 12 min read

PCA Dimensionality Reduction

An unsupervised learning technique that reduces the dimensionality of your datasets.
It models a linear subspace of your data by capturing its greatest variability.
It assumes a linear relationship between features.
Goal: captures your datasets most variant directions, i.e. feature inportance ranked by variance

under the hood: It accesses your dataset's covariance structure directly using matrix calculations and eigenvectors to compute the best unique features that describe your samples.

  • be sure to use feature scaling (SciKit-Learn's StandardScaler is a good-fit for taking care of scaling your data before performing dimensionality reduction.)
  • linear transformation only: rotate and translate the feature space of your samples, but will not skew them.
  • lose all column header names after .transform()
In [1]:
# General import
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from sklearn import preprocessing

plt.style.use('ggplot')

Feature scaling

set scaleFeatures = True to activate feature scaling

In [2]:
def scaleFeaturesDF(df):
    # Feature scaling is a type of transformation that only changes the
    # scale, but not number of features. Because of this, we can still
    # use the original dataset's column names... so long as we keep in
    # mind that the _units_ have been altered:

    scaled = preprocessing.StandardScaler().fit_transform(df)
    scaled = pd.DataFrame(scaled, columns=df.columns)
    
    print("New Variances:\n", scaled.var())
    print("New Describe:\n", scaled.describe())
    return scaled
scaleFeatures = False

Boilerplate Code for visualisation

In [3]:
def drawVectors(transformed_features, components_, columns, plt, scaled):
    if not scaled:
        return plt.axes() # No cheating ;-)

    num_columns = len(columns)

    # This funtion will project your *original* feature (columns)
    # onto your principal component feature-space, so that you can
    # visualize how "important" each one was in the
    # multi-dimensional scaling

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])

    ## visualize projections

    # Sort each column by it's length. These are your *original*
    # columns, not the principal components.
    important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
    important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
    print("Features by importance:\n", important_features)

    ax = plt.axes()

    for i in range(num_columns):
        # Use an arrow to project each original feature as a
        # labeled vector on your principal component axes
        plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
        plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)

    return ax

PCA concept

In [4]:
def computePCA(input_data):
    # Subtract the mean of each feature from itself
    M = (input_data - input_data.T.mean(axis=1)).T
    
    # Compute covariance matrix
    C = np.cov(M)
    
    # Compute eigen values + vectors of the matrix
    latent, coeff = np.linalg.eig(C)
    
    # Sort eigenvalues in order of greatest importance
    index = np.argsort(latent[::-1])
    
    # Sort eigenvectors by their sorted eigenvalues
    coeff = coeff[:, index]
    latent = latent[index]
    
    return latent

SciKit-Learn PCA

  • n_components: the number of dimensions you wish to keep. It should be <= original features
  • svd_solver: dictates if a full singular value decomposition should be preformed on your data, or a randomized truncated one.
    • ='full': razor-sharp, machine-precision matrix operations
    • ='randomized': sacrifice a bit of accuracy for computational efficiency
In [9]:
# Load and prep data
df = pd.read_csv('~/dat210x/DAT210x/Module4/Datasets/kidney_disease.csv', index_col='id')
df = df.dropna(axis=0)
labels = ['red' if i=='ckd' else 'green' for i in df.classification]
selected_col = ['bgr','wc','rc']
df[selected_col] = df[selected_col].apply(pd.to_numeric, errors='coerce', axis=1)
df = df[selected_col]
In [16]:
from sklearn.decomposition import PCA

# Create PCA instance
pca = PCA(n_components=2, svd_solver='full')

# One-liner for the following three lines
T = pca.fit_transform(df)

#pca.fit(df)
#PCA(copy=True, n_components=2, whiten=False)
#T = pca.transform(df)

Attributes of the PCA model after trained with .fit()

  • components_: Principal component vectors that are linear combinations of the original features. As such, they exist within the feature space of your original dataset.
  • explained_variance_: Calculated amount of variance which exists in the newly computed principal components.
  • explained_variance_ratio_: Normalized version of explained_variance_ for when your interest is with probabilities.

Plotting transformed data

pca.transform returns a NumPy NDArray

  1. plot NumPy array with Matplotlib
  2. convert NumPy Array data back to Pandas DataFrame and plot directly through Pandas
In [ ]:
type(T)
In [14]:
# 1. With Matplotlib
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_title('Plot with Matplotlib')
ax.scatter(T[:,0], T[:,1], c='blue', marker='.', alpha=0.75)
plt.show()
In [15]:
# 2. Convert back to Pandas DataFrame
if scaleFeatures: df = scaleFeaturesDF(df)
ax = drawVectors(T, pca.components_, df.columns.values, plt, scaleFeatures)
T  = pd.DataFrame(T)

# Plot through Pandas
T.columns = ['component1', 'component2']
T.plot.scatter(x='component1', y='component2', marker='o', c=labels, alpha=0.75, ax=ax)

plt.show()