Data Splitting
Posted on Jun 14, 2018 in Notes • 10 min read
Data Splitting¶
To test a trained model and to prevent overfitting, it is important to split the data into a training set and a testing set with care.
SciKit-Learn Implementation¶
train_test_split
parameters
test_size
: percentage of dataset to be assigned to test, default=0.25random_state
: specified if desires to reproduce results
In [3]:
# Splitting data with sklearn
from sklearn.model_selection import train_test_split
data = [0,1,2,3,4, 5,6,7,8,9] # input dataframe samples
labels = [0,0,0,0,0, 1,1,1,1,1] # the function we're training is " >4 "
data_train, data_test, label_train, label_test = train_test_split(data, labels, test_size=0.5, random_state=7)
Out[3]:
Type of y (label/answer)¶
MUST be a DataFrame or a 2DArray
When manually slicing
y_train = X[['WhiteMale']][X.Year < 1986]
returns a DataFrame --> OKy_train = X.WhiteMale[X.Year < 1986]
returns a Series (1DArray) --> REQUIRES TRANSFORMATION
In [ ]:
# Transforms Series into 2DArray
y_train = y_train.reshape(-1,1)
Evaluating Model¶
After training model against the training data (data_train, label_train)
Test it with testing data
In [ ]:
from sklearn.metrics import accuracy_score
# Returns an array of predictions:
predictions = my_model.predict(data_test)
In [ ]:
# The actual answers:
label_test
In [ ]:
# Get accuracy of the model
accuracy_score(label_test, predictions)