Data Evaluation

Posted on Jul 16, 2018 in Notes • 15 min read

Choosing ML Algorithms

  • The dimensionality of your data
  • The geometric nature of your data
  • The types of features used to represent your data
  • The number of training samples you have at your disposal
  • The required training and prediction speeds needed for your purposes
  • The predictive accuracy level desired
  • How configurable you need your model to be

Scikit learn

Scikit-learn ML Cheat Sheet

title

Microsoft Azure

Microsoft Azure ML Algorithm Cheat Sheet Article

title

Evaluating Algorithm Performance

Confusion Matrix

A confusion matrix displays your model's predicted (testing set) outputs against the true observational values.
Traditionally, the predicted targets are aligned on the X-axis of the matrix, and the true values are aligned on the Y-axis.

In [5]:
import sklearn.metrics as metrics
y_true = [1, 1, 2, 2, 3, 3]  # Actual, observed testing dataset values
y_pred = [1, 1, 1, 3, 2, 3]  # Values predicted 
metrics.confusion_matrix(y_true, y_pred)
Out[5]:
array([[2, 0, 0],
       [1, 0, 1],
       [0, 1, 1]])

title

Diagonals from the top left to the bottom right: correctly classified labels
Sum up the values in a row: true count of data
Sum up the values in a column: predicted count of model

In [8]:
# Visulisation of the confusion matrix
import matplotlib.pyplot as plt

columns = ['Cat', 'Dog', 'Monkey']
confusion = metrics.confusion_matrix(y_true, y_pred)

plt.imshow(confusion, cmap=plt.cm.Blues, interpolation='nearest')
plt.xticks([0,1,2], columns, rotation='vertical')
plt.yticks([0,1,2], columns)
plt.colorbar()

plt.show()

Scoring Metric

  • Condition Positive (P): Actual positive samples
  • Condition Negative (N): Actual negative samples
  • True Positive (TP/hit): Positives correctly predicted
  • True Negative (TN): Negatives correctly predicted
  • False Postive (FP/false alarm): Negatives predicted as positives
  • Fales Negative (FN/miss): Positives predicted as negatives
  1. True Positive Rate/Sensitivity/Model Recall/Hit Rate: TP/P
  2. True Negative Rate/Specificity: TN/N
  3. Recall Score: TP/(TP+FN)
  4. Precision: TP/(TP+FP)
  5. F1 Score: 2 * (Precision * Recall) / (Precision + Recall)
  6. Positive Predictive Value/Precision: TP/(TN+FP)
  7. Negative Predictive Value: TN/(TN+FP)
In [9]:
# Same as `model.score()`
metrics.accuracy_score(y_true, y_pred)
Out[9]:
0.5
In [10]:
# Recall Score
metrics.recall_score(y_true, y_pred, average='weighted')
Out[10]:
0.5
In [11]:
# Precision Score
metrics.precision_score(y_true, y_pred, average='weighted')
Out[11]:
0.38888888888888884
In [12]:
# F1 Score
metrics.f1_score(y_true, y_pred, average='weighted')
Out[12]:
0.43333333333333335
In [15]:
# Full Report on a per label basis
target_names = ['Cat', 'Dog', 'Monkey']
print(metrics.classification_report(y_true, y_pred, target_names=target_names)) # Must be printed for formmated result
             precision    recall  f1-score   support

        Cat       0.67      1.00      0.80         2
        Dog       0.00      0.00      0.00         2
     Monkey       0.50      0.50      0.50         2

avg / total       0.39      0.50      0.43         6

Cross Validation

Cross validation allows you to use the same training data to both fit and score your model without the need for an additional validation set.
It allows you to use all the data you provide as both training and testing.
It simplifies the overall process.

Problems with train_test_split()

  • Without a deterministic selection of training data and testing data, you might train using the best subset of data but test on outliers, or some permutation in-between.
  • By withholding data from training, you essentially lose some of your training data.
  • Some information of testing set leaked into your training set during iterations over the configurable parameters.

cross_val_score()

  1. input: model, training set, 'K' (K-fold cross validations)
  2. training set cut into 'K' sets
  3. model duplicated into 'K' versions
  4. each version of the model trained with 'K-1' sets of training data
  5. each version of the model evaluated with the out-of-bag set

cv parameter

  • (None) uses the default 3-fold cross validation
  • (int) number of folds in a (Stratified)K-Fold
  • (str) an object to be used as a cross-validation generator
    1. Leave-One-Out: ideally each having all samples except one.
    2. K-Fold: ideally of equal size.
    3. Stratified K-Fold: ideally each group having the same proportion of target classes.
    4. Label K-Fold: ideally the same target never appearing in both testing and training groups simultaneously.

if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
DOCUMENTATION

In [ ]:
# 10-Fold Cross Validation on your training data   
from sklearn.model_selection import cross_val_score   
cross_val_score(model, X_train, y_train, cv=10)   # returns an array of cval scores for each version of the model
cross_val_score(model, X_train, y_train, cv=10).mean()   # return the mean score of all versions of the model

Process

  1. Split your data into training, validation, and testing sets.
  2. Setup a model, and fit it with your training set
  3. Access the accuracy of its output using your validation set
  4. Fine tune this accuracy by adjusting the hyper-parameters of your model
  5. when you're comfortable with its accuracy, finally evaluate your model with the testing set

OR

  1. Split your data into training and testing sets.
  2. Setup a model with cross validation and fit / score it with your training set
  3. Fine tune this accuracy by adjusting the hyper-parameters of your model
  4. When you're comfortable with its accuracy, finally evaluate your model with the testing set

Power Tuning

GridSearchCV

Scikit-learn's systematic way of tuning parameters with end-to-end cross validation.
Explicitly define the parameters you want tested.
Pass in an estimator, a grid of parameters you want optimized, and your cv split value.

DOCUMENTATION

In [17]:
from sklearn import svm, grid_search, datasets
iris = datasets.load_iris()   
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 5, 10]}   
model = svm.SVC() 
classifier = grid_search.GridSearchCV(model, parameters)   
classifier.fit(iris.data, iris.target)    
Out[17]:
GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'kernel': ('linear', 'rbf'), 'C': [1, 5, 10]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

RandomizedSearchCV

Randomized parameter optimization.
Pass in your parameters as a single dictionary that holds either possible, discrete parameter values or distribution over them.
SciPy's Statistics module have many such functions you can use to create continuous, discrete, and multivariate type distributions, such as expon, gamma, uniform, randint and many more

DOCUMENTATION

In [19]:
# Create dictionary of distributions of parameters
from scipy import stats

parameter_dist = {  
  'C': stats.expon(scale=100),   
  'kernel': ['linear'],   
  'gamma': stats.expon(scale=.1),   
}   

classifier = grid_search.RandomizedSearchCV(model, parameter_dist)   
classifier.fit(iris.data, iris.target)  
Out[19]:
RandomizedSearchCV(cv=None, error_score='raise',
          estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x11692e128>, 'kernel': ['linear'], 'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x11692e278>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          scoring=None, verbose=0)

Pipelining

A Scikit-learn class that wraps around your entire data analysis pipeline from start to finish, and allows you to interact with the pipeline as if it were a single white-box, configurable estimator. Cnce your pipeline has been built, since the pipeline inherits from the estimator base class, you can use it pretty much anywhere you'd use regular estimators—including in your cross validator method. Doing so, you can simultaneously fine tune the parameters of each of the estimators and predictors that comprise your data-analysis pipeline.

Usage

  • Every intermediary model within the pipeline must be a transformer, i.e. its class must implement both the .fit() and the .transform() methods.
  • The very last step in your analysis pipeline only needs to implement the .fit() method, since it will not be feeding data into another step.
  • Two underscores after estimator names and before parameters
  • The pipeline class only has a single attribute called .named_steps, which is a dictionary containing the estimator names you specified as keys.

DOCUMENTATION

In [25]:
# Pipeline example
from sklearn.pipeline import Pipeline
from sklearn.decomposition import RandomizedPCA

svc = svm.SVC(kernel='linear')
pca = RandomizedPCA()

pipeline = Pipeline([
  ('pca', pca),
  ('svc', svc)
])

pipeline.set_params(pca__n_components=5, svc__C=1, svc__gamma=0.0001)
pipeline.fit(X, y)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:58: DeprecationWarning: Class RandomizedPCA is deprecated; RandomizedPCA was deprecated in 0.18 and will be removed in 0.20. Use PCA(svd_solver='randomized') instead. The new implementation DOES NOT store whiten ``components_``. Apply transform to get them.
  warnings.warn(msg, category=DeprecationWarning)
Out[25]:
Pipeline(memory=None,
     steps=[('pca', RandomizedPCA(copy=True, iterated_power=2, n_components=5, random_state=None,
       whiten=False)), ('svc', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

IMPORTANT: Many of the predictors don't actually implement .transform()! Due to this, by default, you won't be able to use SVC, Linear Regression, or Decision Trees, etc. as intermediary steps within your pipeline. A very nifty hack you should be aware of to circumvent this is by writing your own transformer class, which simply wraps a predictor and masks it as a transformer:

In [26]:
from sklearn.base import TransformerMixin

class ModelTransformer(TransformerMixin):
    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        # This is the magic =)
        return DataFrame(self.model.predict(X))