# FIXME
# X_train.info() to visualize the problem
# knn = kNeighborsRegressor()
# knn.fit(X_train, y_train)
# ValueError:...Data Preprocessing
Imputation
Purpose: tackling missing values
- Models are not able to deal with missing values (
NaNs)- See below for example
- Possible solutions
- Delete the rows
- Cons: not good if dataset is small
- Imputation
- for Categories: fill by
“missing”ormodeof training data - for Numerics: fill by
meanormedianof training data
- for Categories: fill by
- Delete the rows
How?
prepare df
import pandas as pd
df = pd.read_csv("data/canada_usa_cities.csv")
df
# FIXME| longitude | latitude | country | |
|---|---|---|---|
| 0 | -130.0437 | 55.9773 | USA |
| 1 | -134.4197 | 58.3019 | USA |
| 2 | -123.0780 | 48.9854 | USA |
| 3 | -122.7436 | 48.9881 | USA |
| 4 | -122.2691 | 48.9951 | USA |
| ... | ... | ... | ... |
| 204 | -72.7218 | 45.3990 | Canada |
| 205 | -66.6458 | 45.9664 | Canada |
| 206 | -79.2506 | 42.9931 | Canada |
| 207 | -72.9406 | 45.6275 | Canada |
| 208 | -79.4608 | 46.3092 | Canada |
209 rows × 3 columns
from sklearn.preprocessing import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer.fit(X_train)
X_train_imp = imputer.transform(X_train)
X_test_imp = imputer.transform(X_test)
# FIXME: visualizeScaling
Purpose: for numeric features
- Features with different scaling is a huge problem for \(k\)-NN and SVM
- Distance is dominated by the features with larger values
- Features with smaller values are being ignored, but they can be highly informative!
- FIXME: Example?
- Though not a problem for DecisionTree and Dummy
- DecisionTree looks at features one-by-one
- Dummy only looks at the target Y
- Our models should not be sensitive to scales
- Possible solutions
- Normalization: set range to [0, 1]
(value - min) / max
- Standardization: standard the values s.t. sample (mean, sd) = (0, 1)
(value - sample_mean) / sd
- FIXME: Other two
- Normalization: set range to [0, 1]
How?
# FIXME: load dataStandardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# FIXME: (to visualize the result) pd.DataFrame(X_train_scaled, columns=X_train.columns)Normalization
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# FIXME: (to visualize the result) pd.DataFrame(X_train_scaled, columns=X_train.columns)One-hot encoding
Purpose: tackling categorical variables
- In
scikit-learn, most algorithms require numeric inputs- e.g. for \(k\)-NN, unable to calculate distances
sklearn.DecisionTreedoes not support categorical features- although theoretically it should work
- see below for exmample (FIXME: ValueError: Cannot use median strategy with non-numerica data…)
- Possible solutions
- Drop the column(s) (not recommended)
- those columns might be relevant to the target
- Transform categorical features to numerics. Two ways:
- Ordinal encoding
- One-hot encoding (OHE) \(\leftarrow\) recommended in most cases
- Create binary columns for each category in the feature
- Drop the column(s) (not recommended)
Ordinal encoding
How?
# FIXME: load datafrom sklearn.preprocessing import OrdinalEncoder
encode = OrdinalEncoder()
encode.fit(X_train)
X_train_ord = encode.transform(X_train) # use one feature X_train as example
X_test_ord = encode.transform(X_test)Validation accuracy (max_depth=1): 0.81
Validation accuracy (max_depth=2): 0.81
Validation accuracy (max_depth=3): 0.833
Validation accuracy (max_depth=4): 0.833
Validation accuracy (max_depth=5): 0.905
Validation accuracy (max_depth=6): 0.881
Test accuracy: 0.762
We typically expect \(E_{train} < E_{validation} < E_{test} < E_{deployment}\).
Cons
- Might have imposed unrealistic ordinality in the data
- i.e. not necessarily making sense on distancing
- In the example below, French and Hindi is closer than French and Spanish
df = pd.DataFrame(X_train_ord, …)
pd.concat([X_train, df], axis=1)One-hot encoding (OHE)
How?
from sklearn.preprocessing import OneHotEncoder
encode = OneHotEncoder(handle_unknown=”ignore”, sparse=False, dtype=”int”)
encode.fit(X_train)
X_train_ord = encode.transform(X_train) # use one feature X_train as example
encode.categories_Combining ALL using Pipeline
Purpose
- To allow preprocessing + cross-validation
- To avoid training info leaking into cross-validation set (via the
X_train_scaled)
How?
FIXME: Picture for visualization
optional
# option 1
from sklearn.pipeline import Pipeline
pip = Pipeline(
steps = [
("imputer", SimpleImputer(strategy="median")), # ← the earlier steps should be transformers
("scaler", StandardScaler()),
("regressor", kNeighborsRegressor()), # ← the last step has to be model
]
)NameError: name 'SimpleImputer' is not defined
# option 2: Shorthand
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(
SimpleImputer(strategy="median"),
StandardScaler(),
kNeighborsRegressor()
)
pipe
# The names are automatically defined from the lower case of the functions
# e.g. SimpleImputer → "simpleimputer"# Training only
pipe.fit(X_train, y_train)
pipe.predict(X_train)
# Cross-validation
cross_validate(pipe, X_train, y_train, return_train_score=True)Cons
- All features are forced to go through the same transformations
- We want to apply OHE on categorical features, but NOT numeric features
- We want to apply scaling on numeric features, but NOT categorical features
Combining ALL using Column Transformer + Pipeline (will mostly be used)
Purpose
- In general, we want to apply different preprocessing/transformations on different features
- For numeric features: Imputation + Scaling
- For categorical features: Imputation + One-hot encoding
How?
FIXME: Picture for visualization
1) identifying feature type in the dataset, for example,
numeric_feats = [<colname>, …]
categorical_feats = [<colname>, …]
passthrough_feats = [<colname>, …]
drop_feats = [<colname>, …] # for simpsity and demostration2) apply on appropriate columns
optional
# option 1
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
[
(“scaling”, StandardScaler(), numeric_feats),
(“onehot”, OneHotEncoder(sparse=False), categorical_feats)
]
)# option 2: Shorthand
from sklearn.compose import make_column_transformer
ct = make_column_transformer(
(StandardScaler(), numeric_feats),
(OneHotEncoder(sparse=False), categorical_feats)
(“passthrough”, passthrough_feats),
(“drop”, drop_feats), ← the columns will be dropped even if we don’t have this line
)
ctX_train_tran_array = ct.fit_transform(X_train) # return a np.ndarray
column_names = (
numeric_feats
+ ct.named_transformers_[“onehotencoder”].get_feature_names().tolist()
+ passthrough_feats
)
print(column_names)X_train_tran = pd.DataFrame(X_train_tran_array, columns=column_names)
X_train_tranfrom sklearn.pipeline import make_pipeline
pipe = make_pipeline(ct, SVC())
pipe.fit(X_train, y_train)
pipe.predict(X_test)Pros
- Build all our transformations together into one object
- e.g. we would not forget to apply certain transformation in the test data
Cons:
- Problem with cross_validate?