Reducing Generalization Error

Author

DSCI 571 - Supervised Learning I

Purpose

Fundamental goal of ML: To generalize beyond what we see in the training samples

We often have access to only limited amount of training data, but we want to learn a mapping function which would predicts target reasonably well beyond that training data

However, it’s impossible to access generalization error in practice!

Solution (a common way): To reduce and approximate this error by Data Splitting

How?

80%-20% train-test split

prepare df
import pandas as pd

df = pd.read_csv("data/canada_usa_cities.csv")
df
longitude latitude country
0 -130.0437 55.9773 USA
1 -134.4197 58.3019 USA
2 -123.0780 48.9854 USA
3 -122.7436 48.9881 USA
4 -122.2691 48.9951 USA
... ... ... ...
204 -72.7218 45.3990 Canada
205 -66.6458 45.9664 Canada
206 -79.2506 42.9931 Canada
207 -72.9406 45.6275 Canada
208 -79.4608 46.3092 Canada

209 rows × 3 columns

from sklearn.model_selection import train_test_split
df_1 = df.copy()
y, X = df_1.pop("country"), df_1

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)
alternative method (in case of performing exploratory data analysis or visualization on the df_train)
df_train, df_test = train_test_split(
    df, test_size=0.2, random_state=123
)
# # or, via train_size,
# df_train, df_test = train_test_split(
#     df, train_size=0.8, random_state=123
# )

y_train, X_train = df_train.pop("country"), df_train
y_test, X_test = df_test.pop("country"), df_test
visualize training data
import mglearn
import matplotlib.pyplot as plt

mglearn.discrete_scatter(X_train['longitude'], X_train['latitude'], y_train, s=12)
plt.xlabel('longitude')
plt.ylabel('latitude')
Text(0, 0.5, 'latitude')

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

print(f"Train accuracy: {round(clf.score(X_train, y_train), 3)}")
print(f"Test accuracy: {round(clf.score(X_test, y_test), 3)}") # ~ generalization error
Train accuracy: 1.0
Test accuracy: 0.738

Cons

  • Approximated the generalization error, but not reduced yet.
  • We have built a perfect model on training data, but our model is not able to generalize well on the testing data!

Train-validation-test split

  • We train our model using train split
  • And score it using validation split
  • If the score is not good, then we train another model using train split and score it again using validation split
  • Repeat the process until reaching a satisfactory score in the validation split (i.e. hyperparameter tuning)
  • Test the model once to examine the generalization
fit score predict
Train ✔️ ✔️ ✔️
Validation ✔️ ✔️
Test once once
Deployment ✔️
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier # example

# data split
df_train, df_test = train_test_split(
    df, test_size=0.2, random_state=123
)
df_train, df_validation = train_test_split(
    df_train, test_size=0.25, random_state=123
)

y_train, X_train = df_train.pop("country"), df_train
y_validation, X_validation = df_validation.pop("country"), df_validation
y_test, X_test = df_test.pop("country"), df_test

# (iteratively) train model and score
for depth in range(1, 7):
    clf = DecisionTreeClassifier(max_depth=depth)
    clf.fit(X_train, y_train)

    #print(f"Train accuracy (max_depth={depth}): {round(clf.score(X_train, y_train), 3)}")
    print(f"Validation accuracy (max_depth={depth}): {round(clf.score(X_validation, y_validation), 3)}")

# test model
clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)
print(f"Test accuracy: {round(clf.score(X_test, y_test), 3)}") # ~ generalization error
Validation accuracy (max_depth=1): 0.81
Validation accuracy (max_depth=2): 0.81
Validation accuracy (max_depth=3): 0.833
Validation accuracy (max_depth=4): 0.833
Validation accuracy (max_depth=5): 0.905
Validation accuracy (max_depth=6): 0.881
Test accuracy: 0.762

We typically expect \(E_{train} < E_{validation} < E_{test} < E_{deployment}\).

Pros

  • Able to reduce and approximate generalization error
  • Much better than Train-test split, in which we have only examined the training accuracy before looking at testing.

Cons

  • If the data set is small, then the validation set will be tiny and will not be a good representative of testing set.

Cross-validation

  • Split the training data into k-fold
  • Each “fold” takes turn to be a validation set
  • Validation score statistics = mean/variance of each cross-validation score (scores across folds)

training and validation

from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier # example

# data split
df_train, df_test = train_test_split(
    df, test_size=0.2, random_state=123
)
y_train, X_train = df_train.pop("country"), df_train
y_test, X_test = df_test.pop("country"), df_test

# train model and score
clf = DecisionTreeClassifier(max_depth=4)

scores = cross_validate(clf, X_train, y_train, cv=10, return_train_score=True) # is a dictionary
pd.DataFrame(scores)
fit_time score_time test_score train_score
0 0.002524 0.002084 0.764706 0.913333
1 0.002855 0.002135 0.823529 0.906667
2 0.001225 0.000673 0.705882 0.906667
3 0.001533 0.000713 0.941176 0.900000
4 0.000889 0.000631 0.823529 0.906667
5 0.000849 0.000594 0.823529 0.913333
6 0.001619 0.000639 0.705882 0.920000
7 0.001016 0.000566 0.937500 0.900662
8 0.000744 0.000459 0.937500 0.900662
9 0.000704 0.000463 0.937500 0.900662
print(f"Average cv scores: {round(scores['test_score'].mean(), 2)}")
print(f"SD of cv scores: {round(scores['test_score'].var()**0.5, 2)}")
Average cv scores: 0.84
SD of cv scores: 0.09

training, validation and testing

from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier # example

# data split
df_train, df_test = train_test_split(
    df, test_size=0.2, random_state=123
)
y_train, X_train = df_train.pop("country"), df_train
y_test, X_test = df_test.pop("country"), df_test

# (iteratively) train model and score
for depth in range(1, 7):
    clf = DecisionTreeClassifier(max_depth=depth)
    
    scores = cross_validate(clf, X_train, y_train, cv=10, return_train_score=True) # is a dictionary

    print(f"Average cv scores (max_depth={depth}): {round(scores['test_score'].mean(), 3)}")
    print(f"SD of cv scores (max_depth={depth}): {round(scores['test_score'].var()**0.5, 3)}\n")

# test model
clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)
print(f"Test accuracy: {round(clf.score(X_test, y_test), 3)}") # ~ generalization error, comparable to cv error
Average cv scores (max_depth=1): 0.81
SD of cv scores (max_depth=1): 0.085

Average cv scores (max_depth=2): 0.804
SD of cv scores (max_depth=2): 0.086

Average cv scores (max_depth=3): 0.804
SD of cv scores (max_depth=3): 0.09

Average cv scores (max_depth=4): 0.84
SD of cv scores (max_depth=4): 0.09

Average cv scores (max_depth=5): 0.846
SD of cv scores (max_depth=5): 0.083

Average cv scores (max_depth=6): 0.815
SD of cv scores (max_depth=6): 0.061

Test accuracy: 0.833

Pros

  • More powerful, also applied to small data set!
  • Able to examine the variation in the scores across folds
  • Give a more robust estimate of error on unseen data

Remarks

  • We use test error to approximate generalization error (or deployment error)
  • If the test error is “reasonable”, we will deploy the model
  • We typically expect \(E_{\text{train}} < E_{\text{validation}} < E_{\text{test}} < E_{\text{deployment}} < E_{\text{best}}\).

What is underfitting?

  • The model is too simple
  • Both train and validation error are similarly high
  • \(E_{\text{best}} < E_{\text{train}} \lesssim E_{\text{validation}}\)

What is overfitting?

  • The model is too complicated, and specified to training data only
  • Training error is too low, and a big gap exists between training and validation error
  • \(E_{\text{train}} < E_{\text{best}} < E_{\text{validation}}\)

What is Bias vs. Variance tradeoff?

  • A fundamental tradeoff in supervised learning
    • Complexity \(\uparrow\) \(\Rightarrow\) \(E_\text{train} \downarrow\) but \((E_\text{validation} - E_\text{train}) \uparrow\)
  • Bias \(\Leftrightarrow\) Underfitting: the tendency to consistently learn the same wrong thing
  • Variance \(\Leftrightarrow\) Overfitting: the tendency to learn random things irrespective of real signals

source

Golden rule: The test data cannot influence the training phase in any way

  • To avoid breaking it, we always keep our testing set in an imaginary vault when we’re splitting data