Random Forest Classification

Posted on Jul 16, 2018 in Notes • 12 min read

Random Forest Classifier¶

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.

Instead of relying on a single tree, random forests rely on a forest of cleverly grown decision trees.
Each decision tree in the forest randomly samples from the overall training data set. Through doing so, each tree exist in an independent subspace and the variation between trees is controlled. This technique is known as tree bagging, or bootstrap aggregating.
This technique increases the accuracy of decision trees because while an individual decision tree might become hypersensitive to outliers and localized features, once all the results are averaged, the fringe results get blurred out. Therefore using random forests over a single tree decrease the variance of your classification results, without increasing the bias the way KNeighbor does when K is set too high. This is achieved only if the individual trees are not correlated. If they were all trained on the same training set, they would only reinforce each other's decision. The bootstrapping or randomization of samples each tree is trained upon take care of that.

Since each tree within the forest is only trained using a subset of the overall training set, the forest ensemble has the ability to error test itself. It does this by scoring each tree's predictions against that tree's out-of-bag samples. This error value is defined as the mean prediction error for each training samples using only those trees that didn't have the sample in their bootstrap.

SciKit-Learn gives you access to an array containing each of the trained trees in the forest, in case you'd like to inspect them individually.

SciKit-Learn also supports doing regression with decision trees and random forests. (the output needs to be a continuous variable, and is calculated as the average result of each tree.)

Bootstrap / Bagging¶

Every trained tree is grown using an independently drawn subset of your input data. As such, training samples not used for training an individual tree are considered out-of-bag for that one tree.

Out-of-bag¶

The out-of-bag samples for a particular tree are those training samples that were withheld during the training of that particular tree.

__init__(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)

parameters

n_estimators: (int) number of trees
bootstrap: (bool) whether to use bootstrap samples when building trees

attributes

estimators_: returns list of the individual decision trees
oob_score_: scoring of the training data based on out-of-bag samples

GOTCHAS

All the advantages of decision tree classifiers
Almost always boosts your scoring accuracy above that of a single decision tree
Both training and prediction execution times suffer tremendously compared to a single decision tree
Lose the ability to inspect the resulting structure of your classifier as easily

In [ ]:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10, oob_score=True)
model.fit(X, y)

Random Forest Classifier¶

Bootstrap / Bagging¶

Out-of-bag¶

You might enjoy