# FIXME
# X_train.info() to visualize the problem
# knn = kNeighborsRegressor()
# knn.fit(X_train, y_train)
# ValueError:...
Data Preprocessing
Imputation
Purpose: tackling missing values
- Models are not able to deal with missing values (
NaN
s)- See below for example
- Possible solutions
- Delete the rows
- Cons: not good if dataset is small
- Imputation
- for Categories: fill by
“missing”
ormode
of training data - for Numerics: fill by
mean
ormedian
of training data
- for Categories: fill by
- Delete the rows
How?
prepare df
import pandas as pd
= pd.read_csv("data/canada_usa_cities.csv")
df
df
# FIXME
longitude | latitude | country | |
---|---|---|---|
0 | -130.0437 | 55.9773 | USA |
1 | -134.4197 | 58.3019 | USA |
2 | -123.0780 | 48.9854 | USA |
3 | -122.7436 | 48.9881 | USA |
4 | -122.2691 | 48.9951 | USA |
... | ... | ... | ... |
204 | -72.7218 | 45.3990 | Canada |
205 | -66.6458 | 45.9664 | Canada |
206 | -79.2506 | 42.9931 | Canada |
207 | -72.9406 | 45.6275 | Canada |
208 | -79.4608 | 46.3092 | Canada |
209 rows × 3 columns
from sklearn.preprocessing import SimpleImputer
= SimpleImputer(strategy="median")
imputer
imputer.fit(X_train)
= imputer.transform(X_train)
X_train_imp = imputer.transform(X_test)
X_test_imp
# FIXME: visualize
Scaling
Purpose: for numeric features
- Features with different scaling is a huge problem for \(k\)-NN and SVM
- Distance is dominated by the features with larger values
- Features with smaller values are being ignored, but they can be highly informative!
- FIXME: Example?
- Though not a problem for DecisionTree and Dummy
- DecisionTree looks at features one-by-one
- Dummy only looks at the target Y
- Our models should not be sensitive to scales
- Possible solutions
- Normalization: set range to [0, 1]
(value - min) / max
- Standardization: standard the values s.t. sample (mean, sd) = (0, 1)
(value - sample_mean) / sd
- FIXME: Other two
- Normalization: set range to [0, 1]
How?
# FIXME: load data
Standardization
from sklearn.preprocessing import StandardScaler
= StandardScaler()
scaler
scaler.fit(X_train)= scaler.transform(X_train)
X_train_scaled = scaler.transform(X_test)
X_test_scaled
# FIXME: (to visualize the result) pd.DataFrame(X_train_scaled, columns=X_train.columns)
Normalization
from sklearn.preprocessing import MinMaxScaler
= MinMaxScaler()
scaler
scaler.fit(X_train)= scaler.transform(X_train)
X_train_scaled = scaler.transform(X_test)
X_test_scaled
# FIXME: (to visualize the result) pd.DataFrame(X_train_scaled, columns=X_train.columns)
One-hot encoding
Purpose: tackling categorical variables
- In
scikit-learn
, most algorithms require numeric inputs- e.g. for \(k\)-NN, unable to calculate distances
sklearn.DecisionTree
does not support categorical features- although theoretically it should work
- see below for exmample (FIXME: ValueError: Cannot use median strategy with non-numerica data…)
- Possible solutions
- Drop the column(s) (not recommended)
- those columns might be relevant to the target
- Transform categorical features to numerics. Two ways:
- Ordinal encoding
- One-hot encoding (OHE) \(\leftarrow\) recommended in most cases
- Create binary columns for each category in the feature
- Drop the column(s) (not recommended)
Ordinal encoding
How?
# FIXME: load data
from sklearn.preprocessing import OrdinalEncoder
= OrdinalEncoder()
encode
encode.fit(X_train)= encode.transform(X_train) # use one feature X_train as example
X_train_ord = encode.transform(X_test) X_test_ord
Validation accuracy (max_depth=1): 0.81
Validation accuracy (max_depth=2): 0.81
Validation accuracy (max_depth=3): 0.833
Validation accuracy (max_depth=4): 0.833
Validation accuracy (max_depth=5): 0.905
Validation accuracy (max_depth=6): 0.881
Test accuracy: 0.762
We typically expect \(E_{train} < E_{validation} < E_{test} < E_{deployment}\).
Cons
- Might have imposed unrealistic ordinality in the data
- i.e. not necessarily making sense on distancing
- In the example below, French and Hindi is closer than French and Spanish
= pd.DataFrame(X_train_ord, …)
df =1) pd.concat([X_train, df], axis
One-hot encoding (OHE)
How?
from sklearn.preprocessing import OneHotEncoder
= OneHotEncoder(handle_unknown=”ignore”, sparse=False, dtype=”int”)
encode
encode.fit(X_train)
= encode.transform(X_train) # use one feature X_train as example
X_train_ord encode.categories_
Combining ALL using Pipeline
Purpose
- To allow preprocessing + cross-validation
- To avoid training info leaking into cross-validation set (via the
X_train_scaled
)
How?
FIXME: Picture for visualization
optional
# option 1
from sklearn.pipeline import Pipeline
= Pipeline(
pip = [
steps "imputer", SimpleImputer(strategy="median")), # ← the earlier steps should be transformers
("scaler", StandardScaler()),
("regressor", kNeighborsRegressor()), # ← the last step has to be model
(
] )
NameError: name 'SimpleImputer' is not defined
# option 2: Shorthand
from sklearn.pipeline import make_pipeline
= make_pipeline(
pipe ="median"),
SimpleImputer(strategy
StandardScaler(),
kNeighborsRegressor()
)
pipe# The names are automatically defined from the lower case of the functions
# e.g. SimpleImputer → "simpleimputer"
# Training only
pipe.fit(X_train, y_train)
pipe.predict(X_train)
# Cross-validation
=True) cross_validate(pipe, X_train, y_train, return_train_score
Cons
- All features are forced to go through the same transformations
- We want to apply OHE on categorical features, but NOT numeric features
- We want to apply scaling on numeric features, but NOT categorical features
Combining ALL using Column Transformer + Pipeline (will mostly be used)
Purpose
- In general, we want to apply different preprocessing/transformations on different features
- For numeric features: Imputation + Scaling
- For categorical features: Imputation + One-hot encoding
How?
FIXME: Picture for visualization
1) identifying feature type in the dataset, for example,
= [<colname>, …]
numeric_feats = [<colname>, …]
categorical_feats = [<colname>, …]
passthrough_feats = [<colname>, …] # for simpsity and demostration drop_feats
2) apply on appropriate columns
optional
# option 1
from sklearn.compose import ColumnTransformer
= ColumnTransformer(
ct
[
(“scaling”, StandardScaler(), numeric_feats),=False), categorical_feats)
(“onehot”, OneHotEncoder(sparse
] )
# option 2: Shorthand
from sklearn.compose import make_column_transformer
= make_column_transformer(
ct
(StandardScaler(), numeric_feats),=False), categorical_feats)
(OneHotEncoder(sparse
(“passthrough”, passthrough_feats),if we don’t have this line
(“drop”, drop_feats), ← the columns will be dropped even
) ct
= ct.fit_transform(X_train) # return a np.ndarray
X_train_tran_array
= (
column_names
numeric_feats+ ct.named_transformers_[“onehotencoder”].get_feature_names().tolist()
+ passthrough_feats
)print(column_names)
= pd.DataFrame(X_train_tran_array, columns=column_names)
X_train_tran X_train_tran
from sklearn.pipeline import make_pipeline
= make_pipeline(ct, SVC())
pipe
pipe.fit(X_train, y_train) pipe.predict(X_test)
Pros
- Build all our transformations together into one object
- e.g. we would not forget to apply certain transformation in the test data
Cons:
- Problem with cross_validate?