SVM RBF (Support Vector Machine - Radial Basis Function kernel)

Author

DSCI 571 - Supervised Learning I

Use case

Similar to kNN?

What is it?

Analogy-Based Model
- i.e. assigned nearby points the same label
Similar to weighted \(k\)-NN, however, the decision boundary only depends on support vectors (key examples in the dataset)
Define decision boundary by a subset of +ve, -ve examples, their weights and similarity measure
- Test example = +ve if it looks more like a +ve example than -ve
- The similarity metric is called kernel
  - Popular kernel: Radial Basis Functions (RBFs)
- Decision boundary \(\sim\) a smooth version of k-NN’s decision boundary

How?

Classification

prepare X_train, X_test, y_train, y_test

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("data/canada_usa_cities.csv")

y, X = df.pop("country"), df
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

pd.concat([X_train, y_train], axis=1).head()

	longitude	latitude	country
160	-76.4813	44.2307	Canada
127	-81.2496	42.9837	Canada
169	-66.0580	45.2788	Canada
188	-73.2533	45.3057	Canada
187	-67.9245	47.1652	Canada

from sklearn.svm import SVC

svm = SVC(kernel='rbf', gamma = 0.01)
svm.fit(X_train, y_train)
svm.predict(X_test)

array(['Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada',
       'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada', 'Canada',
       'Canada', 'Canada', 'USA', 'Canada', 'USA', 'Canada', 'Canada',
       'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada',
       'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'Canada',
       'Canada', 'Canada', 'Canada', 'Canada', 'USA', 'USA', 'Canada',
       'Canada', 'Canada'], dtype=object)

svm.score(X_test, y_test) # accuracy

0.8333333333333334

## FIXME: visualization: Plot_support_vectors

Regression

prepare X_train, X_test, y_train, y_test

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("data/quiz2-grade-toy-regression.csv")
df = df[['lab1', 'lab2', 'lab3', 'lab4', 'quiz1', 'quiz2']]

y, X = df.pop("quiz2"), df
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

pd.concat([X_train, y_train], axis=1).head()

	lab1	lab2	lab3	lab4	quiz1	quiz2
4	77	83	90	92	85	90
0	92	93	84	91	92	90
2	78	85	83	80	80	82
5	70	73	68	74	71	75
6	80	88	89	88	91	91

from sklearn.svm import SVR

svm = SVR(kernel='rbf', gamma = 0.05)
svm.fit(X_train, y_train)
svm.predict(X_test)

array([89.39814152, 89.40557499])

svm.score(X_test, y_test) # R^2 (it can be -ve, which is worse than DummyRegressor)

-0.12096790646548117

Hyperparameters

gamma
- Control the complexity
- larger \(\rightarrow\) more complex
- smaller \(\rightarrow\) less complex
C
- larger \(\rightarrow\) more complex
- smaller \(\rightarrow\) less complex
Default: Features are equally important
- Which hyperparameter controls the weighting of feature?

Pros

Time and space complexity are better than \(k\)-NN
- No significant difference for small dataset
- But huge speed and memory difference for large dataset
Usually more accurate than \(k\)-NN

Cons

Remarks

svm.support_ gives the indices of support vectors
To optimize the two hyperparameters gamma and C,
- sklearn.model_selection.GridSearchCV
- sklearn.model_selection.RandomizedSearchCV

Curse of dimensionality

If there are too many irrelevant features, the models might get confused.
- as the accidental similarity swamps out meaning similarity
- Might become random guessing \(\rightarrow\) like dummy classifier

?SVC

?SVC

Init signature:
SVC(
    *,
    C=1.0,
    kernel='rbf',
    degree=3,
    gamma='scale',
    coef0=0.0,
    shrinking=True,
    probability=False,
    tol=0.001,
    cache_size=200,
    class_weight=None,
    verbose=False,
    max_iter=-1,
    decision_function_shape='ovr',
    break_ties=False,
    random_state=None,
)
Docstring:     
C-Support Vector Classification.
The implementation is based on libsvm. The fit time scales at least
quadratically with the number of samples and may be impractical
beyond tens of thousands of samples. For large datasets
consider using :class:`~sklearn.svm.LinearSVC` or
:class:`~sklearn.linear_model.SGDClassifier` instead, possibly after a
:class:`~sklearn.kernel_approximation.Nystroem` transformer or
other :ref:`kernel_approximation`.
The multiclass support is handled according to a one-vs-one scheme.
For details on the precise mathematical formulation of the provided
kernel functions and how `gamma`, `coef0` and `degree` affect each
other, see the corresponding section in the narrative documentation:
:ref:`svm_kernels`.
Read more in the :ref:`User Guide <svm_classification>`.
Parameters
----------
C : float, default=1.0
    Regularization parameter. The strength of the regularization is
    inversely proportional to C. Must be strictly positive. The penalty
    is a squared l2 penalty.
kernel : {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'} or callable,          default='rbf'
    Specifies the kernel type to be used in the algorithm. If
    none is given, 'rbf' will be used. If a callable is given it is used to
    pre-compute the kernel matrix from data matrices; that matrix should be
    an array of shape ``(n_samples, n_samples)``. For an intuitive
    visualization of different kernel types see
    :ref:`sphx_glr_auto_examples_svm_plot_svm_kernels.py`.
degree : int, default=3
    Degree of the polynomial kernel function ('poly').
    Must be non-negative. Ignored by all other kernels.
gamma : {'scale', 'auto'} or float, default='scale'
    Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.
    - if ``gamma='scale'`` (default) is passed then it uses
      1 / (n_features * X.var()) as value of gamma,
    - if 'auto', uses 1 / n_features
    - if float, must be non-negative.
    .. versionchanged:: 0.22
       The default value of ``gamma`` changed from 'auto' to 'scale'.
coef0 : float, default=0.0
    Independent term in kernel function.
    It is only significant in 'poly' and 'sigmoid'.
shrinking : bool, default=True
    Whether to use the shrinking heuristic.
    See the :ref:`User Guide <shrinking_svm>`.
probability : bool, default=False
    Whether to enable probability estimates. This must be enabled prior
    to calling `fit`, will slow down that method as it internally uses
    5-fold cross-validation, and `predict_proba` may be inconsistent with
    `predict`. Read more in the :ref:`User Guide <scores_probabilities>`.
tol : float, default=1e-3
    Tolerance for stopping criterion.
cache_size : float, default=200
    Specify the size of the kernel cache (in MB).
class_weight : dict or 'balanced', default=None
    Set the parameter C of class i to class_weight[i]*C for
    SVC. If not given, all classes are supposed to have
    weight one.
    The "balanced" mode uses the values of y to automatically adjust
    weights inversely proportional to class frequencies in the input data
    as ``n_samples / (n_classes * np.bincount(y))``.
verbose : bool, default=False
    Enable verbose output. Note that this setting takes advantage of a
    per-process runtime setting in libsvm that, if enabled, may not work
    properly in a multithreaded context.
max_iter : int, default=-1
    Hard limit on iterations within solver, or -1 for no limit.
decision_function_shape : {'ovo', 'ovr'}, default='ovr'
    Whether to return a one-vs-rest ('ovr') decision function of shape
    (n_samples, n_classes) as all other classifiers, or the original
    one-vs-one ('ovo') decision function of libsvm which has shape
    (n_samples, n_classes * (n_classes - 1) / 2). However, note that
    internally, one-vs-one ('ovo') is always used as a multi-class strategy
    to train models; an ovr matrix is only constructed from the ovo matrix.
    The parameter is ignored for binary classification.
    .. versionchanged:: 0.19
        decision_function_shape is 'ovr' by default.
    .. versionadded:: 0.17
       *decision_function_shape='ovr'* is recommended.
    .. versionchanged:: 0.17
       Deprecated *decision_function_shape='ovo' and None*.
break_ties : bool, default=False
    If true, ``decision_function_shape='ovr'``, and number of classes > 2,
    :term:`predict` will break ties according to the confidence values of
    :term:`decision_function`; otherwise the first class among the tied
    classes is returned. Please note that breaking ties comes at a
    relatively high computational cost compared to a simple predict.
    .. versionadded:: 0.22
random_state : int, RandomState instance or None, default=None
    Controls the pseudo random number generation for shuffling the data for
    probability estimates. Ignored when `probability` is False.
    Pass an int for reproducible output across multiple function calls.
    See :term:`Glossary <random_state>`.
Attributes
----------
class_weight_ : ndarray of shape (n_classes,)
    Multipliers of parameter C for each class.
    Computed based on the ``class_weight`` parameter.
classes_ : ndarray of shape (n_classes,)
    The classes labels.
coef_ : ndarray of shape (n_classes * (n_classes - 1) / 2, n_features)
    Weights assigned to the features (coefficients in the primal
    problem). This is only available in the case of a linear kernel.
    `coef_` is a readonly property derived from `dual_coef_` and
    `support_vectors_`.
dual_coef_ : ndarray of shape (n_classes -1, n_SV)
    Dual coefficients of the support vector in the decision
    function (see :ref:`sgd_mathematical_formulation`), multiplied by
    their targets.
    For multiclass, coefficient for all 1-vs-1 classifiers.
    The layout of the coefficients in the multiclass case is somewhat
    non-trivial. See the :ref:`multi-class section of the User Guide
    <svm_multi_class>` for details.
fit_status_ : int
    0 if correctly fitted, 1 otherwise (will raise warning)
intercept_ : ndarray of shape (n_classes * (n_classes - 1) / 2,)
    Constants in decision function.
n_features_in_ : int
    Number of features seen during :term:`fit`.
    .. versionadded:: 0.24
feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.
    .. versionadded:: 1.0
n_iter_ : ndarray of shape (n_classes * (n_classes - 1) // 2,)
    Number of iterations run by the optimization routine to fit the model.
    The shape of this attribute depends on the number of models optimized
    which in turn depends on the number of classes.
    .. versionadded:: 1.1
support_ : ndarray of shape (n_SV)
    Indices of support vectors.
support_vectors_ : ndarray of shape (n_SV, n_features)
    Support vectors. An empty array if kernel is precomputed.
n_support_ : ndarray of shape (n_classes,), dtype=int32
    Number of support vectors for each class.
probA_ : ndarray of shape (n_classes * (n_classes - 1) / 2)
probB_ : ndarray of shape (n_classes * (n_classes - 1) / 2)
    If `probability=True`, it corresponds to the parameters learned in
    Platt scaling to produce probability estimates from decision values.
    If `probability=False`, it's an empty array. Platt scaling uses the
    logistic function
    ``1 / (1 + exp(decision_value * probA_ + probB_))``
    where ``probA_`` and ``probB_`` are learned from the dataset [2]_. For
    more information on the multiclass case and training procedure see
    section 8 of [1]_.
shape_fit_ : tuple of int of shape (n_dimensions_of_X,)
    Array dimensions of training vector ``X``.
See Also
--------
SVR : Support Vector Machine for Regression implemented using libsvm.
LinearSVC : Scalable Linear Support Vector Machine for classification
    implemented using liblinear. Check the See Also section of
    LinearSVC for more comparison element.
References
----------
.. [1] `LIBSVM: A Library for Support Vector Machines
    <http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf>`_
.. [2] `Platt, John (1999). "Probabilistic Outputs for Support Vector
    Machines and Comparisons to Regularized Likelihood Methods"
    <https://citeseerx.ist.psu.edu/doc_view/pid/42e5ed832d4310ce4378c44d05570439df28a393>`_
Examples
--------
>>> import numpy as np
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
>>> y = np.array([1, 1, 2, 2])
>>> from sklearn.svm import SVC
>>> clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
>>> clf.fit(X, y)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(gamma='auto'))])
>>> print(clf.predict([[-0.8, -1]]))
[1]
File:           ~/miniconda3/envs/571/lib/python3.11/site-packages/sklearn/svm/_classes.py
Type:           ABCMeta
Subclasses:

?SVR

?SVR

Init signature:
SVR(
    *,
    kernel='rbf',
    degree=3,
    gamma='scale',
    coef0=0.0,
    tol=0.001,
    C=1.0,
    epsilon=0.1,
    shrinking=True,
    cache_size=200,
    verbose=False,
    max_iter=-1,
)
Docstring:     
Epsilon-Support Vector Regression.
The free parameters in the model are C and epsilon.
The implementation is based on libsvm. The fit time complexity
is more than quadratic with the number of samples which makes it hard
to scale to datasets with more than a couple of 10000 samples. For large
datasets consider using :class:`~sklearn.svm.LinearSVR` or
:class:`~sklearn.linear_model.SGDRegressor` instead, possibly after a
:class:`~sklearn.kernel_approximation.Nystroem` transformer or
other :ref:`kernel_approximation`.
Read more in the :ref:`User Guide <svm_regression>`.
Parameters
----------
kernel : {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'} or callable,          default='rbf'
     Specifies the kernel type to be used in the algorithm.
     If none is given, 'rbf' will be used. If a callable is given it is
     used to precompute the kernel matrix.
degree : int, default=3
    Degree of the polynomial kernel function ('poly').
    Must be non-negative. Ignored by all other kernels.
gamma : {'scale', 'auto'} or float, default='scale'
    Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.
    - if ``gamma='scale'`` (default) is passed then it uses
      1 / (n_features * X.var()) as value of gamma,
    - if 'auto', uses 1 / n_features
    - if float, must be non-negative.
    .. versionchanged:: 0.22
       The default value of ``gamma`` changed from 'auto' to 'scale'.
coef0 : float, default=0.0
    Independent term in kernel function.
    It is only significant in 'poly' and 'sigmoid'.
tol : float, default=1e-3
    Tolerance for stopping criterion.
C : float, default=1.0
    Regularization parameter. The strength of the regularization is
    inversely proportional to C. Must be strictly positive.
    The penalty is a squared l2 penalty.
epsilon : float, default=0.1
     Epsilon in the epsilon-SVR model. It specifies the epsilon-tube
     within which no penalty is associated in the training loss function
     with points predicted within a distance epsilon from the actual
     value. Must be non-negative.
shrinking : bool, default=True
    Whether to use the shrinking heuristic.
    See the :ref:`User Guide <shrinking_svm>`.
cache_size : float, default=200
    Specify the size of the kernel cache (in MB).
verbose : bool, default=False
    Enable verbose output. Note that this setting takes advantage of a
    per-process runtime setting in libsvm that, if enabled, may not work
    properly in a multithreaded context.
max_iter : int, default=-1
    Hard limit on iterations within solver, or -1 for no limit.
Attributes
----------
class_weight_ : ndarray of shape (n_classes,)
    Multipliers of parameter C for each class.
    Computed based on the ``class_weight`` parameter.
    .. deprecated:: 1.2
        `class_weight_` was deprecated in version 1.2 and will be removed in 1.4.
coef_ : ndarray of shape (1, n_features)
    Weights assigned to the features (coefficients in the primal
    problem). This is only available in the case of a linear kernel.
    `coef_` is readonly property derived from `dual_coef_` and
    `support_vectors_`.
dual_coef_ : ndarray of shape (1, n_SV)
    Coefficients of the support vector in the decision function.
fit_status_ : int
    0 if correctly fitted, 1 otherwise (will raise warning)
intercept_ : ndarray of shape (1,)
    Constants in decision function.
n_features_in_ : int
    Number of features seen during :term:`fit`.
    .. versionadded:: 0.24
feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.
    .. versionadded:: 1.0
n_iter_ : int
    Number of iterations run by the optimization routine to fit the model.
    .. versionadded:: 1.1
n_support_ : ndarray of shape (1,), dtype=int32
    Number of support vectors.
shape_fit_ : tuple of int of shape (n_dimensions_of_X,)
    Array dimensions of training vector ``X``.
support_ : ndarray of shape (n_SV,)
    Indices of support vectors.
support_vectors_ : ndarray of shape (n_SV, n_features)
    Support vectors.
See Also
--------
NuSVR : Support Vector Machine for regression implemented using libsvm
    using a parameter to control the number of support vectors.
LinearSVR : Scalable Linear Support Vector Machine for regression
    implemented using liblinear.
References
----------
.. [1] `LIBSVM: A Library for Support Vector Machines
    <http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf>`_
.. [2] `Platt, John (1999). "Probabilistic Outputs for Support Vector
    Machines and Comparisons to Regularized Likelihood Methods"
    <https://citeseerx.ist.psu.edu/doc_view/pid/42e5ed832d4310ce4378c44d05570439df28a393>`_
Examples
--------
>>> from sklearn.svm import SVR
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> import numpy as np
>>> n_samples, n_features = 10, 5
>>> rng = np.random.RandomState(0)
>>> y = rng.randn(n_samples)
>>> X = rng.randn(n_samples, n_features)
>>> regr = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))
>>> regr.fit(X, y)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svr', SVR(epsilon=0.2))])
File:           ~/miniconda3/envs/571/lib/python3.11/site-packages/sklearn/svm/_classes.py
Type:           ABCMeta
Subclasses: