Dummy Classifier/Regressor

Author

DSCI 571 - Supervised Learning I

Use case

Serve as baseline, a simple ML algo based on simple rules of thumb

What is it?

For classification: use the mode of y_train to predict y_test
For regression: use the mean / median / constant value of y_train to predict y_test

How?

Classification

read df

import pandas as pd

# Prepare data
df = pd.read_csv("data/quiz2-grade-toy-classification.csv")
df

	ml_experience	class_attendance	lab1	lab2	lab3	lab4	quiz1
0	1	1	92	93	84	91	92
1	1	0	94	90	80	83	91
2	0	0	78	85	83	80	80
3	0	1	91	94	92	91	89
4	0	1	77	83	90	92	85

from sklearn.dummy import DummyClassifier
y, X = df.pop("quiz2"), df

clf = DummyClassifier(strategy="most_frequent")
clf.fit(X, y)
clf.predict(X)

array(['not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+'], dtype='<U6')

clf.score(X, y) # accuracy

0.5238095238095238

Regression

read df

import pandas as pd

df = pd.read_csv("data/quiz2-grade-toy-regression.csv")
df

	ml_experience	class_attendance	lab1	lab2	lab3	lab4	quiz1
0	1	1	92	93	84	91	92
1	1	0	94	90	80	83	91
2	0	0	78	85	83	80	80
3	0	1	91	94	92	91	89
4	0	1	77	83	90	92	85

from sklearn.dummy import DummyRegressor
y, X = df.pop("quiz2"), df

reg = DummyRegressor(strategy="mean")
reg.fit(X, y)
reg.predict(X)

array([86.28571429, 86.28571429, 86.28571429, 86.28571429, 86.28571429,
       86.28571429, 86.28571429])

reg.score(X, y) # R^2

0.0

Hyperparameters

strategy
- (DummyClassifier) {“most_frequent”, “prior”, “stratified”, “uniform”, “constant”}. Default: “prior”
- (DummyRegressor) {“mean”, “median”, “quantile”, “constant”}. Default: “mean”
constant
- specified if strategy = "constant"
- for DummyClassifier, the constant must exist in the y