k-Nearest Neighbours

Author

DSCI 571 - Supervised Learning I

Use case

Facial recognition
- e.g. feature vectors for human faces
- e.g. identify which face is on their watch list
Recommendation systems

What is it?

Analogy-Based Model
- i.e. assigned nearby points the same label
Using targets \(y_\text{train}\)s from the k-nearest neighbours \(X_\text{train}\)s from \(X_\text{new}\), to predict \(y_\text{new}\)
1. gather the k-nearest neighbour \(X\)s based on euclidean distance (# of features = # of dimensions)
2. predict based on voting (for classification) or average/median of \(y_\text{train}\) (for regression)
Non-parametric model
- i.e. no parameters associated with the model
- stores O(n) worth of stuff to make prediction
- (in contrast to parametric models that stores only the limited amount of parameters and formulae)
Lazy algo: it requires no time to fit

How?

Classification

prepare X_train, X_test, y_train, y_test

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("data/canada_usa_cities.csv")

y, X = df.pop("country"), df
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

pd.concat([X_train, y_train], axis=1).head()

	longitude	latitude	country
160	-76.4813	44.2307	Canada
127	-81.2496	42.9837	Canada
169	-66.0580	45.2788	Canada
188	-73.2533	45.3057	Canada
187	-67.9245	47.1652	Canada

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
knn.predict(X_test)

array(['Canada', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada',
       'Canada', 'USA', 'USA', 'USA', 'Canada', 'Canada', 'Canada',
       'Canada', 'USA', 'Canada', 'USA', 'Canada', 'Canada', 'Canada',
       'Canada', 'Canada', 'USA', 'Canada', 'Canada', 'USA', 'Canada',
       'Canada', 'USA', 'Canada', 'USA', 'Canada', 'Canada', 'Canada',
       'Canada', 'Canada', 'USA', 'USA', 'Canada', 'Canada', 'Canada'],
      dtype=object)

knn.score(X_test, y_test) # accuracy

0.7142857142857143

## FIXME: visualization

Regression

prepare X_train, X_test, y_train, y_test

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("data/quiz2-grade-toy-regression.csv")
df = df[['lab1', 'lab2', 'lab3', 'lab4', 'quiz1', 'quiz2']]

y, X = df.pop("quiz2"), df
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

pd.concat([X_train, y_train], axis=1).head()

	lab1	lab2	lab3	lab4	quiz1	quiz2
4	77	83	90	92	85	90
0	92	93	84	91	92	90
2	78	85	83	80	80	82
5	70	73	68	74	71	75
6	80	88	89	88	91	91

from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=1)
knn.fit(X_train, y_train)
knn.predict(X_test)

array([90., 90.])

knn.score(X_test, y_test) # R^2 (it can be -ve, which is worse than DummyRegressor)

-0.25

Hyperparameters

n_neighbors
- larger \(\rightarrow\) under-fitting
- smaller (e.g. 1) \(\rightarrow\) over-fitting
- Default: 5
weights
- weighting on features for distance calculation.
- Default: 'uniform', i.e. equal-weighted

Pros

Easy to understand and interpret
Simple hyperparameters for controlling bias-variance tradeoff
Can learn very complex functions with sufficient amount of data
Lazy learning: Take no time to fit

Cons

Take long time to make prediction, not useful in real time applications
Not accurate compared to modern approaches
Not work well in the following scenarios:
- Datasets with many features; or,
- Spare datasets: Values in most features are mostly 0

Remarks

Curse of dimensionality

If there are too many irrelevant features, the \(k\)-NN models might get confused.
- as the accidental similarity swamps out meaning similarity
- \(k\)-NN might become random guessing \(\rightarrow\) like dummy classifier

?KNeighborsClassifier

?KNeighborsClassifier

Init signature:
KNeighborsClassifier(
    n_neighbors=5,
    *,
    weights='uniform',
    algorithm='auto',
    leaf_size=30,
    p=2,
    metric='minkowski',
    metric_params=None,
    n_jobs=None,
)
Docstring:     
Classifier implementing the k-nearest neighbors vote.
Read more in the :ref:`User Guide <classification>`.
Parameters
----------
n_neighbors : int, default=5
    Number of neighbors to use by default for :meth:`kneighbors` queries.
weights : {'uniform', 'distance'}, callable or None, default='uniform'
    Weight function used in prediction.  Possible values:
    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
    - [callable] : a user-defined function which accepts an
      array of distances, and returns an array of the same shape
      containing the weights.
    Refer to the example entitled
    :ref:`sphx_glr_auto_examples_neighbors_plot_classification.py`
    showing the impact of the `weights` parameter on the decision
    boundary.
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'
    Algorithm used to compute the nearest neighbors:
    - 'ball_tree' will use :class:`BallTree`
    - 'kd_tree' will use :class:`KDTree`
    - 'brute' will use a brute-force search.
    - 'auto' will attempt to decide the most appropriate algorithm
      based on the values passed to :meth:`fit` method.
    Note: fitting on sparse input will override the setting of
    this parameter, using brute force.
leaf_size : int, default=30
    Leaf size passed to BallTree or KDTree.  This can affect the
    speed of the construction and query, as well as the memory
    required to store the tree.  The optimal value depends on the
    nature of the problem.
p : float, default=2
    Power parameter for the Minkowski metric. When p = 1, this is
    equivalent to using manhattan_distance (l1), and euclidean_distance
    (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric : str or callable, default='minkowski'
    Metric to use for distance computation. Default is "minkowski", which
    results in the standard Euclidean distance when p = 2. See the
    documentation of `scipy.spatial.distance
    <https://docs.scipy.org/doc/scipy/reference/spatial.distance.html>`_ and
    the metrics listed in
    :class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric
    values.
    If metric is "precomputed", X is assumed to be a distance matrix and
    must be square during fit. X may be a :term:`sparse graph`, in which
    case only "nonzero" elements may be considered neighbors.
    If metric is a callable function, it takes two arrays representing 1D
    vectors as inputs and must return one value indicating the distance
    between those vectors. This works for Scipy's metrics, but is less
    efficient than passing the metric name as a string.
metric_params : dict, default=None
    Additional keyword arguments for the metric function.
n_jobs : int, default=None
    The number of parallel jobs to run for neighbors search.
    ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
    ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
    for more details.
    Doesn't affect :meth:`fit` method.
Attributes
----------
classes_ : array of shape (n_classes,)
    Class labels known to the classifier
effective_metric_ : str or callble
    The distance metric used. It will be same as the `metric` parameter
    or a synonym of it, e.g. 'euclidean' if the `metric` parameter set to
    'minkowski' and `p` parameter set to 2.
effective_metric_params_ : dict
    Additional keyword arguments for the metric function. For most metrics
    will be same with `metric_params` parameter, but may also contain the
    `p` parameter value if the `effective_metric_` attribute is set to
    'minkowski'.
n_features_in_ : int
    Number of features seen during :term:`fit`.
    .. versionadded:: 0.24
feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.
    .. versionadded:: 1.0
n_samples_fit_ : int
    Number of samples in the fitted data.
outputs_2d_ : bool
    False when `y`'s shape is (n_samples, ) or (n_samples, 1) during fit
    otherwise True.
See Also
--------
RadiusNeighborsClassifier: Classifier based on neighbors within a fixed radius.
KNeighborsRegressor: Regression based on k-nearest neighbors.
RadiusNeighborsRegressor: Regression based on neighbors within a fixed radius.
NearestNeighbors: Unsupervised learner for implementing neighbor searches.
Notes
-----
See :ref:`Nearest Neighbors <neighbors>` in the online documentation
for a discussion of the choice of ``algorithm`` and ``leaf_size``.
.. warning::
   Regarding the Nearest Neighbors algorithms, if it is found that two
   neighbors, neighbor `k+1` and `k`, have identical distances
   but different labels, the results will depend on the ordering of the
   training data.
https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
Examples
--------
>>> X = [[0], [1], [2], [3]]
>>> y = [0, 0, 1, 1]
>>> from sklearn.neighbors import KNeighborsClassifier
>>> neigh = KNeighborsClassifier(n_neighbors=3)
>>> neigh.fit(X, y)
KNeighborsClassifier(...)
>>> print(neigh.predict([[1.1]]))
[0]
>>> print(neigh.predict_proba([[0.9]]))
[[0.666... 0.333...]]
File:           ~/miniconda3/envs/571/lib/python3.11/site-packages/sklearn/neighbors/_classification.py
Type:           ABCMeta
Subclasses:

?KNeighborsRegressor

?KNeighborsRegressor

Init signature:
KNeighborsRegressor(
    n_neighbors=5,
    *,
    weights='uniform',
    algorithm='auto',
    leaf_size=30,
    p=2,
    metric='minkowski',
    metric_params=None,
    n_jobs=None,
)
Docstring:     
Regression based on k-nearest neighbors.
The target is predicted by local interpolation of the targets
associated of the nearest neighbors in the training set.
Read more in the :ref:`User Guide <regression>`.
.. versionadded:: 0.9
Parameters
----------
n_neighbors : int, default=5
    Number of neighbors to use by default for :meth:`kneighbors` queries.
weights : {'uniform', 'distance'}, callable or None, default='uniform'
    Weight function used in prediction.  Possible values:
    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
    - [callable] : a user-defined function which accepts an
      array of distances, and returns an array of the same shape
      containing the weights.
    Uniform weights are used by default.
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'
    Algorithm used to compute the nearest neighbors:
    - 'ball_tree' will use :class:`BallTree`
    - 'kd_tree' will use :class:`KDTree`
    - 'brute' will use a brute-force search.
    - 'auto' will attempt to decide the most appropriate algorithm
      based on the values passed to :meth:`fit` method.
    Note: fitting on sparse input will override the setting of
    this parameter, using brute force.
leaf_size : int, default=30
    Leaf size passed to BallTree or KDTree.  This can affect the
    speed of the construction and query, as well as the memory
    required to store the tree.  The optimal value depends on the
    nature of the problem.
p : float, default=2
    Power parameter for the Minkowski metric. When p = 1, this is
    equivalent to using manhattan_distance (l1), and euclidean_distance
    (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric : str or callable, default='minkowski'
    Metric to use for distance computation. Default is "minkowski", which
    results in the standard Euclidean distance when p = 2. See the
    documentation of `scipy.spatial.distance
    <https://docs.scipy.org/doc/scipy/reference/spatial.distance.html>`_ and
    the metrics listed in
    :class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric
    values.
    If metric is "precomputed", X is assumed to be a distance matrix and
    must be square during fit. X may be a :term:`sparse graph`, in which
    case only "nonzero" elements may be considered neighbors.
    If metric is a callable function, it takes two arrays representing 1D
    vectors as inputs and must return one value indicating the distance
    between those vectors. This works for Scipy's metrics, but is less
    efficient than passing the metric name as a string.
metric_params : dict, default=None
    Additional keyword arguments for the metric function.
n_jobs : int, default=None
    The number of parallel jobs to run for neighbors search.
    ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
    ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
    for more details.
    Doesn't affect :meth:`fit` method.
Attributes
----------
effective_metric_ : str or callable
    The distance metric to use. It will be same as the `metric` parameter
    or a synonym of it, e.g. 'euclidean' if the `metric` parameter set to
    'minkowski' and `p` parameter set to 2.
effective_metric_params_ : dict
    Additional keyword arguments for the metric function. For most metrics
    will be same with `metric_params` parameter, but may also contain the
    `p` parameter value if the `effective_metric_` attribute is set to
    'minkowski'.
n_features_in_ : int
    Number of features seen during :term:`fit`.
    .. versionadded:: 0.24
feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.
    .. versionadded:: 1.0
n_samples_fit_ : int
    Number of samples in the fitted data.
See Also
--------
NearestNeighbors : Unsupervised learner for implementing neighbor searches.
RadiusNeighborsRegressor : Regression based on neighbors within a fixed radius.
KNeighborsClassifier : Classifier implementing the k-nearest neighbors vote.
RadiusNeighborsClassifier : Classifier implementing
    a vote among neighbors within a given radius.
Notes
-----
See :ref:`Nearest Neighbors <neighbors>` in the online documentation
for a discussion of the choice of ``algorithm`` and ``leaf_size``.
.. warning::
   Regarding the Nearest Neighbors algorithms, if it is found that two
   neighbors, neighbor `k+1` and `k`, have identical distances but
   different labels, the results will depend on the ordering of the
   training data.
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
Examples
--------
>>> X = [[0], [1], [2], [3]]
>>> y = [0, 0, 1, 1]
>>> from sklearn.neighbors import KNeighborsRegressor
>>> neigh = KNeighborsRegressor(n_neighbors=2)
>>> neigh.fit(X, y)
KNeighborsRegressor(...)
>>> print(neigh.predict([[1.5]]))
[0.5]
File:           ~/miniconda3/envs/571/lib/python3.11/site-packages/sklearn/neighbors/_regression.py
Type:           ABCMeta
Subclasses: