Random Forest Classification

Posted on Jul 16, 2018 in Notes • 12 min read

Random Forest Classifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

Instead of relying on a single tree, random forests rely on a forest of cleverly grown decision trees.
Each decision tree in the forest randomly samples from the overall training data set. Through doing so, each tree exist in an independent subspace and the variation between trees is controlled. This technique is known as tree bagging, or bootstrap aggregating.
This technique increases the accuracy of decision trees because while an individual decision tree might become hypersensitive to outliers and localized features, once all the results are averaged, the fringe results get blurred out. Therefore using random forests over a single tree decrease the variance of your classification results, without increasing the bias the way KNeighbor does when K is set too high. This is achieved only if the individual trees are not correlated. If they were all trained on the same training set, they would only reinforce each other's decision. The bootstrapping or randomization of samples each tree is trained upon take care of that.

Since each tree within the forest is only trained using a subset of the overall training set, the forest ensemble has the ability to error test itself. It does this by scoring each tree's predictions against that tree's out-of-bag samples. This error value is defined as the mean prediction error for each training samples using only those trees that didn't have the sample in their bootstrap.

SciKit-Learn gives you access to an array containing each of the trained trees in the forest, in case you'd like to inspect them individually.

SciKit-Learn also supports doing regression with decision trees and random forests. (the output needs to be a continuous variable, and is calculated as the average result of each tree.)

Bootstrap / Bagging

Every trained tree is grown using an independently drawn subset of your input data. As such, training samples not used for training an individual tree are considered out-of-bag for that one tree.

Out-of-bag

The out-of-bag samples for a particular tree are those training samples that were withheld during the training of that particular tree.

__init__(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)

parameters

  • n_estimators: (int) number of trees
  • bootstrap: (bool) whether to use bootstrap samples when building trees

attributes

  • estimators_: returns list of the individual decision trees
  • oob_score_: scoring of the training data based on out-of-bag samples

GOTCHAS

  • All the advantages of decision tree classifiers
  • Almost always boosts your scoring accuracy above that of a single decision tree
  • Both training and prediction execution times suffer tremendously compared to a single decision tree
  • Lose the ability to inspect the resulting structure of your classifier as easily
In [ ]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10, oob_score=True)
model.fit(X, y)