Data Splitting

Posted on Jun 14, 2018 in Notes • 10 min read

Data Splitting

To test a trained model and to prevent overfitting, it is important to split the data into a training set and a testing set with care.

SciKit-Learn Implementation

train_test_split

parameters

  • test_size: percentage of dataset to be assigned to test, default=0.25
  • random_state: specified if desires to reproduce results

DOCUMENTATION

In [3]:
# Splitting data with sklearn 
from sklearn.model_selection import train_test_split
data   = [0,1,2,3,4, 5,6,7,8,9]  # input dataframe samples
labels = [0,0,0,0,0, 1,1,1,1,1]  # the function we're training is " >4 "

data_train, data_test, label_train, label_test = train_test_split(data, labels, test_size=0.5, random_state=7)
Out[3]:
[1, 1, 0, 1, 0]

Type of y (label/answer)

MUST be a DataFrame or a 2DArray
When manually slicing

  • y_train = X[['WhiteMale']][X.Year < 1986] returns a DataFrame --> OK
  • y_train = X.WhiteMale[X.Year < 1986] returns a Series (1DArray) --> REQUIRES TRANSFORMATION
In [ ]:
# Transforms Series into 2DArray
y_train = y_train.reshape(-1,1)

Evaluating Model

After training model against the training data (data_train, label_train)
Test it with testing data

In [ ]:
from sklearn.metrics import accuracy_score

# Returns an array of predictions:
predictions = my_model.predict(data_test) 
In [ ]:
# The actual answers:
label_test
In [ ]:
# Get accuracy of the model
accuracy_score(label_test, predictions)