Dummy Classifier/Regressor

Author

DSCI 571 - Supervised Learning I

Use case

Serve as baseline, a simple ML algo based on simple rules of thumb

What is it?

  • For classification: use the mode of y_train to predict y_test
  • For regression: use the mean / median / constant value of y_train to predict y_test

How?

Classification

read df
import pandas as pd

# Prepare data
df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
df
ml_experience class_attendance lab1 lab2 lab3 lab4 quiz1
0 1 1 92 93 84 91 92
1 1 0 94 90 80 83 91
2 0 0 78 85 83 80 80
3 0 1 91 94 92 91 89
4 0 1 77 83 90 92 85
from sklearn.dummy import DummyClassifier
y, X = df.pop("quiz2"), df

clf = DummyClassifier(strategy="most_frequent")
clf.fit(X, y)
clf.predict(X)
array(['not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+'], dtype='<U6')
clf.score(X, y) # accuracy
0.5238095238095238

Regression

read df
import pandas as pd

df = pd.read_csv("data/quiz2-grade-toy-regression.csv")
df
ml_experience class_attendance lab1 lab2 lab3 lab4 quiz1
0 1 1 92 93 84 91 92
1 1 0 94 90 80 83 91
2 0 0 78 85 83 80 80
3 0 1 91 94 92 91 89
4 0 1 77 83 90 92 85
from sklearn.dummy import DummyRegressor
y, X = df.pop("quiz2"), df

reg = DummyRegressor(strategy="mean")
reg.fit(X, y)
reg.predict(X)
array([86.28571429, 86.28571429, 86.28571429, 86.28571429, 86.28571429,
       86.28571429, 86.28571429])
reg.score(X, y) # R^2
0.0

Hyperparameters

  • strategy
    • (DummyClassifier) {“most_frequent”, “prior”, “stratified”, “uniform”, “constant”}. Default: “prior”
    • (DummyRegressor) {“mean”, “median”, “quantile”, “constant”}. Default: “mean”
  • constant
    • specified if strategy = "constant"
    • for DummyClassifier, the constant must exist in the y