Data Preprocessing

Author

DSCI 571 - Supervised Learning I

Imputation

Purpose: tackling missing values

Models are not able to deal with missing values (NaNs)
- See below for example
Possible solutions
- Delete the rows
  - Cons: not good if dataset is small
- Imputation
  - for Categories: fill by “missing” or mode of training data
  - for Numerics: fill by mean or median of training data

# FIXME
# X_train.info() to visualize the problem


# knn = kNeighborsRegressor()
# knn.fit(X_train, y_train)
# ValueError:...

How?

prepare df

import pandas as pd

df = pd.read_csv("data/canada_usa_cities.csv")
df

# FIXME

	longitude	latitude	country
0	-130.0437	55.9773	USA
1	-134.4197	58.3019	USA
2	-123.0780	48.9854	USA
3	-122.7436	48.9881	USA
4	-122.2691	48.9951	USA
...	...	...	...
204	-72.7218	45.3990	Canada
205	-66.6458	45.9664	Canada
206	-79.2506	42.9931	Canada
207	-72.9406	45.6275	Canada
208	-79.4608	46.3092	Canada

209 rows × 3 columns

from sklearn.preprocessing import SimpleImputer

imputer = SimpleImputer(strategy="median")
imputer.fit(X_train)

X_train_imp = imputer.transform(X_train)
X_test_imp = imputer.transform(X_test) 

# FIXME: visualize

Scaling

Purpose: for numeric features

Features with different scaling is a huge problem for \(k\)-NN and SVM
- Distance is dominated by the features with larger values
- Features with smaller values are being ignored, but they can be highly informative!
- FIXME: Example?
- Though not a problem for DecisionTree and Dummy
  - DecisionTree looks at features one-by-one
  - Dummy only looks at the target Y
Our models should not be sensitive to scales
Possible solutions
- Normalization: set range to [0, 1]
  - (value - min) / max
- Standardization: standard the values s.t. sample (mean, sd) = (0, 1)
  - (value - sample_mean) / sd
- FIXME: Other two

How?

# FIXME: load data

Standardization

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# FIXME: (to visualize the result) pd.DataFrame(X_train_scaled, columns=X_train.columns)

Normalization

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# FIXME: (to visualize the result) pd.DataFrame(X_train_scaled, columns=X_train.columns)

One-hot encoding

Purpose: tackling categorical variables

In scikit-learn, most algorithms require numeric inputs
- e.g. for \(k\)-NN, unable to calculate distances
- sklearn.DecisionTree does not support categorical features
  - although theoretically it should work
  - see below for exmample (FIXME: ValueError: Cannot use median strategy with non-numerica data…)
Possible solutions
- Drop the column(s) (not recommended)
  - those columns might be relevant to the target
- Transform categorical features to numerics. Two ways:
  - Ordinal encoding
  - One-hot encoding (OHE) \(\leftarrow\) recommended in most cases
    - Create binary columns for each category in the feature

Ordinal encoding

How?

# FIXME: load data

from sklearn.preprocessing import OrdinalEncoder

encode = OrdinalEncoder()
encode.fit(X_train)
X_train_ord = encode.transform(X_train) # use one feature X_train as example
X_test_ord = encode.transform(X_test)

Validation accuracy (max_depth=1): 0.81
Validation accuracy (max_depth=2): 0.81
Validation accuracy (max_depth=3): 0.833
Validation accuracy (max_depth=4): 0.833
Validation accuracy (max_depth=5): 0.905
Validation accuracy (max_depth=6): 0.881
Test accuracy: 0.762

We typically expect \(E_{train} < E_{validation} < E_{test} < E_{deployment}\).

Cons

Might have imposed unrealistic ordinality in the data
- i.e. not necessarily making sense on distancing
- In the example below, French and Hindi is closer than French and Spanish

df = pd.DataFrame(X_train_ord, …)
pd.concat([X_train, df], axis=1)

One-hot encoding (OHE)

How?

from sklearn.preprocessing import OneHotEncoder

encode = OneHotEncoder(handle_unknown=”ignore”, sparse=False, dtype=”int”)
encode.fit(X_train)

X_train_ord = encode.transform(X_train) # use one feature X_train as example
encode.categories_

Combining ALL using Pipeline

Purpose

To allow preprocessing + cross-validation
To avoid training info leaking into cross-validation set (via the X_train_scaled)

How?

FIXME: Picture for visualization

optional

# option 1
from sklearn.pipeline import Pipeline

pip = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy="median")), # ← the earlier steps should be transformers
        ("scaler", StandardScaler()),
        ("regressor", kNeighborsRegressor()),          # ← the last step has to be model
    ]
)

NameError: name 'SimpleImputer' is not defined

# option 2: Shorthand
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(
    SimpleImputer(strategy="median"), 
    StandardScaler(), 
    kNeighborsRegressor()
)

pipe
# The names are automatically defined from the lower case of the functions
# e.g. SimpleImputer → "simpleimputer"

# Training only
pipe.fit(X_train, y_train)
pipe.predict(X_train)

# Cross-validation
cross_validate(pipe, X_train, y_train, return_train_score=True)

Cons

All features are forced to go through the same transformations
- We want to apply OHE on categorical features, but NOT numeric features
- We want to apply scaling on numeric features, but NOT categorical features

Combining ALL using Column Transformer + Pipeline (will mostly be used)

Purpose

In general, we want to apply different preprocessing/transformations on different features
- For numeric features: Imputation + Scaling
- For categorical features: Imputation + One-hot encoding

How?

FIXME: Picture for visualization

1) identifying feature type in the dataset, for example,

numeric_feats = [<colname>, …]
categorical_feats = [<colname>, …]
passthrough_feats = [<colname>, …]
drop_feats  = [<colname>, …] # for simpsity and demostration

2) apply on appropriate columns

optional

# option 1
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
  [
    (“scaling”, StandardScaler(), numeric_feats),
    (“onehot”, OneHotEncoder(sparse=False), categorical_feats)
  ]
)

# option 2: Shorthand
from sklearn.compose import make_column_transformer

ct = make_column_transformer(
  (StandardScaler(), numeric_feats),
  (OneHotEncoder(sparse=False), categorical_feats)
  (“passthrough”, passthrough_feats),
  (“drop”, drop_feats), ← the columns will be dropped even if we don’t have this line
)
ct

X_train_tran_array = ct.fit_transform(X_train) # return a np.ndarray

column_names = (
  numeric_feats
  + ct.named_transformers_[“onehotencoder”].get_feature_names().tolist()
  + passthrough_feats
)
print(column_names)

X_train_tran = pd.DataFrame(X_train_tran_array, columns=column_names)
X_train_tran

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(ct, SVC())
pipe.fit(X_train, y_train)
pipe.predict(X_test)

Pros

Build all our transformations together into one object
- e.g. we would not forget to apply certain transformation in the test data

Cons:

Problem with cross_validate?