Data Visualisation

Posted on Jun 14, 2018 in Notes • 12 min read

Data Visualisation

Visualising data is a great way in the preliminary phase to observe and get a better understanding of the data. Also, it is imperative to check and present results. Here are notes on some of the more complicated graphs.

In [35]:
# General pkgs
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
In [36]:
# FOR SOME STYLE
matplotlib.style.use('ggplot')
# If the above line throws an error, use plt.style.use('ggplot') instead

3D Scatter plot

In [37]:
# 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D

# 3D scatter plot
df_wheat = pd.read_csv("~/dat210x/DAT210x/Module3/Datasets/wheat.data")

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel('area')
ax.set_ylabel('perimeter')
ax.set_zlabel('asymmetry')

ax.scatter(df_wheat.area, df_wheat.perimeter, df_wheat.asymmetry, c='red', marker='.')
plt.show()

Higher Dimensionality Visualisations

  • Parallel Coordinates
  • Andrew's Curves
  • imshow()
In [38]:
# Examples for Higher Dimensionality visualisations
from sklearn.datasets import load_iris

# Look pretty...
matplotlib.style.use('ggplot')
# If the above line throws an error, use plt.style.use('ggplot') instead

# Load up SKLearn's Iris Dataset into a Pandas Dataframe
data = load_iris()
df_iris = pd.DataFrame(data.data, columns=data.feature_names) 

df_iris['target_names'] = [data.target_names[i] for i in data.target]

Parallel Coordinates

In [43]:
# Parallel Coordinates
from pandas.plotting import parallel_coordinates

plt.figure()
parallel_coordinates(df, 'target_names')
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x113adf2b0>

Andrew's Curves

In [44]:
# Andrew's Curves
from pandas.plotting import andrews_curves
plt.figure()
andrews_curves(df_iris, 'target_names')
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a173bb208>

imshow()

MatPlotLib's .imshow() method

Generates an image based off of the normalized values stored in a matrix, or rectangular array of float64s.

e.g. Visualise high dimensionality correlation matrix, covariance matrix, confusion matrix

The properties of the generated image will depend on the dimensions and contents of the array passed in:

  • [X, Y] shaped array will result in a grayscale image being generated
  • [X, Y, 3] shaped array results in a full-color image: 1 channel for red, 1 for green, and 1 for blue
  • [X, Y, 4] shaped array results in a full-color image as before with an extra channel for alpha
In [45]:
# Example dataset
df_rand = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
df_rand.corr()
Out[45]:
a b c d e
a 1.000000 -0.015695 0.059968 -0.028391 0.003533
b -0.015695 1.000000 0.018520 0.006865 0.001792
c 0.059968 0.018520 1.000000 0.022543 -0.051418
d -0.028391 0.006865 0.022543 1.000000 -0.009500
e 0.003533 0.001792 -0.051418 -0.009500 1.000000
In [46]:
# Visualise this correlation matrix with imshow()
plt.imshow(df_rand.corr(), cmap=plt.cm.Blues, interpolation='nearest')
plt.colorbar()
tick_marks = [i for i in range(len(df_rand.columns))]
plt.xticks(tick_marks, df_rand.columns, rotation='vertical')
plt.yticks(tick_marks, df_rand.columns)

2D Decision Boundary for Classification Models

In [47]:
# Plot 2D decision boundary for classification model

def plotDecisionBoundary(model, X, y):
    print("Plotting...")

    fig = plt.figure()
    ax = fig.add_subplot(111)

    padding = 0.1
    resolution = 0.1

    #(2 for benign, 4 for malignant)
    colors = {2:'royalblue', 4:'lightsalmon'} 


    # Calculate the boundaris
    x_min, x_max = X[:, 0].min(), X[:, 0].max()
    y_min, y_max = X[:, 1].min(), X[:, 1].max()
    x_range = x_max - x_min
    y_range = y_max - y_min
    x_min -= x_range * padding
    y_min -= y_range * padding
    x_max += x_range * padding
    y_max += y_range * padding

    # Create a 2D Grid Matrix. The values stored in the matrix
    # are the predictions of the class at at said location
    xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution),
                         np.arange(y_min, y_max, resolution))

    # What class does the classifier say?
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot the contour map
    plt.contourf(xx, yy, Z, cmap=plt.cm.seismic)
    plt.axis('tight')

    # Plot your testing points as well...
    for label in np.unique(y):
        indices = np.where(y == label)
        plt.scatter(X[indices, 0], X[indices, 1], c=colors[label], alpha=0.8)

    p = model.get_params()
    plt.title('K = ' + str(p['n_neighbors']))
    plt.show()