Data Visualisation
Posted on Jun 14, 2018 in Notes • 12 min read
Data Visualisation¶
Visualising data is a great way in the preliminary phase to observe and get a better understanding of the data. Also, it is imperative to check and present results. Here are notes on some of the more complicated graphs.
In [35]:
# General pkgs
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
In [36]:
# FOR SOME STYLE
matplotlib.style.use('ggplot')
# If the above line throws an error, use plt.style.use('ggplot') instead
3D Scatter plot¶
In [37]:
# 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D
# 3D scatter plot
df_wheat = pd.read_csv("~/dat210x/DAT210x/Module3/Datasets/wheat.data")
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel('area')
ax.set_ylabel('perimeter')
ax.set_zlabel('asymmetry')
ax.scatter(df_wheat.area, df_wheat.perimeter, df_wheat.asymmetry, c='red', marker='.')
plt.show()
Higher Dimensionality Visualisations¶
- Parallel Coordinates
- Andrew's Curves
imshow()
In [38]:
# Examples for Higher Dimensionality visualisations
from sklearn.datasets import load_iris
# Look pretty...
matplotlib.style.use('ggplot')
# If the above line throws an error, use plt.style.use('ggplot') instead
# Load up SKLearn's Iris Dataset into a Pandas Dataframe
data = load_iris()
df_iris = pd.DataFrame(data.data, columns=data.feature_names)
df_iris['target_names'] = [data.target_names[i] for i in data.target]
Parallel Coordinates¶
In [43]:
# Parallel Coordinates
from pandas.plotting import parallel_coordinates
plt.figure()
parallel_coordinates(df, 'target_names')
Out[43]:
Andrew's Curves¶
In [44]:
# Andrew's Curves
from pandas.plotting import andrews_curves
plt.figure()
andrews_curves(df_iris, 'target_names')
Out[44]:
imshow()
¶
MatPlotLib's .imshow()
method
Generates an image based off of the normalized values stored in a matrix, or rectangular array of float64s.
e.g. Visualise high dimensionality correlation matrix, covariance matrix, confusion matrix
The properties of the generated image will depend on the dimensions and contents of the array passed in:
[X, Y]
shaped array will result in a grayscale image being generated[X, Y, 3]
shaped array results in a full-color image: 1 channel for red, 1 for green, and 1 for blue[X, Y, 4]
shaped array results in a full-color image as before with an extra channel for alpha
In [45]:
# Example dataset
df_rand = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
df_rand.corr()
Out[45]:
In [46]:
# Visualise this correlation matrix with imshow()
plt.imshow(df_rand.corr(), cmap=plt.cm.Blues, interpolation='nearest')
plt.colorbar()
tick_marks = [i for i in range(len(df_rand.columns))]
plt.xticks(tick_marks, df_rand.columns, rotation='vertical')
plt.yticks(tick_marks, df_rand.columns)
2D Decision Boundary for Classification Models¶
In [47]:
# Plot 2D decision boundary for classification model
def plotDecisionBoundary(model, X, y):
print("Plotting...")
fig = plt.figure()
ax = fig.add_subplot(111)
padding = 0.1
resolution = 0.1
#(2 for benign, 4 for malignant)
colors = {2:'royalblue', 4:'lightsalmon'}
# Calculate the boundaris
x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min, y_max = X[:, 1].min(), X[:, 1].max()
x_range = x_max - x_min
y_range = y_max - y_min
x_min -= x_range * padding
y_min -= y_range * padding
x_max += x_range * padding
y_max += y_range * padding
# Create a 2D Grid Matrix. The values stored in the matrix
# are the predictions of the class at at said location
xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution),
np.arange(y_min, y_max, resolution))
# What class does the classifier say?
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour map
plt.contourf(xx, yy, Z, cmap=plt.cm.seismic)
plt.axis('tight')
# Plot your testing points as well...
for label in np.unique(y):
indices = np.where(y == label)
plt.scatter(X[indices, 0], X[indices, 1], c=colors[label], alpha=0.8)
p = model.get_params()
plt.title('K = ' + str(p['n_neighbors']))
plt.show()