Data Preprocessing

Author

DSCI 571 - Supervised Learning I

Imputation

Purpose: tackling missing values

  • Models are not able to deal with missing values (NaNs)
    • See below for example
  • Possible solutions
    • Delete the rows
      • Cons: not good if dataset is small
    • Imputation
      • for Categories: fill by “missing” or mode of training data
      • for Numerics: fill by mean or median of training data
# FIXME
# X_train.info() to visualize the problem


# knn = kNeighborsRegressor()
# knn.fit(X_train, y_train)
# ValueError:...

How?

prepare df
import pandas as pd

df = pd.read_csv("data/canada_usa_cities.csv")
df

# FIXME
longitude latitude country
0 -130.0437 55.9773 USA
1 -134.4197 58.3019 USA
2 -123.0780 48.9854 USA
3 -122.7436 48.9881 USA
4 -122.2691 48.9951 USA
... ... ... ...
204 -72.7218 45.3990 Canada
205 -66.6458 45.9664 Canada
206 -79.2506 42.9931 Canada
207 -72.9406 45.6275 Canada
208 -79.4608 46.3092 Canada

209 rows × 3 columns

from sklearn.preprocessing import SimpleImputer

imputer = SimpleImputer(strategy="median")
imputer.fit(X_train)

X_train_imp = imputer.transform(X_train)
X_test_imp = imputer.transform(X_test) 

# FIXME: visualize

Scaling

Purpose: for numeric features

  • Features with different scaling is a huge problem for \(k\)-NN and SVM
    • Distance is dominated by the features with larger values
    • Features with smaller values are being ignored, but they can be highly informative!
    • FIXME: Example?
    • Though not a problem for DecisionTree and Dummy
      • DecisionTree looks at features one-by-one
      • Dummy only looks at the target Y
  • Our models should not be sensitive to scales
  • Possible solutions
    • Normalization: set range to [0, 1]
      • (value - min) / max
    • Standardization: standard the values s.t. sample (mean, sd) = (0, 1)
      • (value - sample_mean) / sd
    • FIXME: Other two

How?

# FIXME: load data

Standardization

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# FIXME: (to visualize the result) pd.DataFrame(X_train_scaled, columns=X_train.columns)

Normalization

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# FIXME: (to visualize the result) pd.DataFrame(X_train_scaled, columns=X_train.columns)

One-hot encoding

Purpose: tackling categorical variables

  • In scikit-learn, most algorithms require numeric inputs
    • e.g. for \(k\)-NN, unable to calculate distances
    • sklearn.DecisionTree does not support categorical features
      • although theoretically it should work
      • see below for exmample (FIXME: ValueError: Cannot use median strategy with non-numerica data…)
  • Possible solutions
    • Drop the column(s) (not recommended)
      • those columns might be relevant to the target
    • Transform categorical features to numerics. Two ways:
      • Ordinal encoding
      • One-hot encoding (OHE) \(\leftarrow\) recommended in most cases
        • Create binary columns for each category in the feature

Ordinal encoding

How?

# FIXME: load data
from sklearn.preprocessing import OrdinalEncoder

encode = OrdinalEncoder()
encode.fit(X_train)
X_train_ord = encode.transform(X_train) # use one feature X_train as example
X_test_ord = encode.transform(X_test)
Validation accuracy (max_depth=1): 0.81
Validation accuracy (max_depth=2): 0.81
Validation accuracy (max_depth=3): 0.833
Validation accuracy (max_depth=4): 0.833
Validation accuracy (max_depth=5): 0.905
Validation accuracy (max_depth=6): 0.881
Test accuracy: 0.762

We typically expect \(E_{train} < E_{validation} < E_{test} < E_{deployment}\).

Cons

  • Might have imposed unrealistic ordinality in the data
    • i.e. not necessarily making sense on distancing
    • In the example below, French and Hindi is closer than French and Spanish
df = pd.DataFrame(X_train_ord, …)
pd.concat([X_train, df], axis=1)

One-hot encoding (OHE)

How?

from sklearn.preprocessing import OneHotEncoder

encode = OneHotEncoder(handle_unknown=”ignore”, sparse=False, dtype=”int”)
encode.fit(X_train)

X_train_ord = encode.transform(X_train) # use one feature X_train as example
encode.categories_

Combining ALL using Pipeline

Purpose

  • To allow preprocessing + cross-validation
  • To avoid training info leaking into cross-validation set (via the X_train_scaled)

How?

FIXME: Picture for visualization

optional
# option 1
from sklearn.pipeline import Pipeline

pip = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy="median")), # ← the earlier steps should be transformers
        ("scaler", StandardScaler()),
        ("regressor", kNeighborsRegressor()),          # ← the last step has to be model
    ]
)
NameError: name 'SimpleImputer' is not defined
# option 2: Shorthand
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(
    SimpleImputer(strategy="median"), 
    StandardScaler(), 
    kNeighborsRegressor()
)

pipe
# The names are automatically defined from the lower case of the functions
# e.g. SimpleImputer → "simpleimputer"
# Training only
pipe.fit(X_train, y_train)
pipe.predict(X_train)

# Cross-validation
cross_validate(pipe, X_train, y_train, return_train_score=True)

Cons

  • All features are forced to go through the same transformations
    • We want to apply OHE on categorical features, but NOT numeric features
    • We want to apply scaling on numeric features, but NOT categorical features

Combining ALL using Column Transformer + Pipeline (will mostly be used)

Purpose

  • In general, we want to apply different preprocessing/transformations on different features
    • For numeric features: Imputation + Scaling
    • For categorical features: Imputation + One-hot encoding

How?

FIXME: Picture for visualization

1) identifying feature type in the dataset, for example,

numeric_feats = [<colname>, …]
categorical_feats = [<colname>, …]
passthrough_feats = [<colname>, …]
drop_feats  = [<colname>, …] # for simpsity and demostration

2) apply on appropriate columns

optional
# option 1
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
  [
    (“scaling”, StandardScaler(), numeric_feats),
    (“onehot”, OneHotEncoder(sparse=False), categorical_feats)
  ]
)
# option 2: Shorthand
from sklearn.compose import make_column_transformer

ct = make_column_transformer(
  (StandardScaler(), numeric_feats),
  (OneHotEncoder(sparse=False), categorical_feats)
  (“passthrough”, passthrough_feats),
  (“drop”, drop_feats), ← the columns will be dropped even if we don’t have this line
)
ct
X_train_tran_array = ct.fit_transform(X_train) # return a np.ndarray

column_names = (
  numeric_feats
  + ct.named_transformers_[“onehotencoder”].get_feature_names().tolist()
  + passthrough_feats
)
print(column_names)
X_train_tran = pd.DataFrame(X_train_tran_array, columns=column_names)
X_train_tran
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(ct, SVC())
pipe.fit(X_train, y_train)
pipe.predict(X_test)

Pros

  • Build all our transformations together into one object
    • e.g. we would not forget to apply certain transformation in the test data

Cons:

  • Problem with cross_validate?