Home » maths » functional analysis » Support Vector Machines and the Kernel Trick

# Support Vector Machines and the Kernel Trick

Today we will be discussing a machine learning algorithm that is probably the most mathematically sophisticated of all the algorithms we have looked at so far.

Don’t let that daunt you though! Not all the details are essential for understanding the general idea, and it can certainly be used without understanding the maths, although if you are content with this then there’s not much point reading a blog like this!

This post is based on what I have learned from reading the relevant chapter in the excellent book: Elements of Statistical Learning.

The Setup

The premise is exactly the same as the perceptron algorithm, we have input data $x_1, \dots , x_N \in \mathbb{R}^k$ with class labels $y_1, \dots, y_N \in \{-1,1\}$.

In order to classify the inputs, we seek a seperating hyperplane $H$. Recall that a hyperplane is just a plane that may not pass through the origin, and can be described as

$H= \{ x \in \mathbb{R}^k : \langle x, \beta \rangle =b \}$ for some $\beta \in \mathbb{R}^k$ and $b \in \mathbb{R}$.

It is handy to assume that $\beta$ is a unit vector, and geometrically it is the normal vector to the hyperplane. I will refer to $b$ as the bias. If the bias is zero, then $H$ is just a regular plane, and if it is nonzero $H$ is parallel to this plane, the bias controls how far ‘above’ or ‘below’ the origin $H$ is. In fact if $H_0 =\{x: \langle x,\beta \rangle =0 \}$, then $H=H_0 + b\beta$.

If we define

$f: \mathbb{R}^k \to \mathbb{R}$

by

$f(x) = \langle x,\beta \rangle - b$,

then $H = f^{-1}(0)$.

Moreover, $f(x)$ is the signed distance of $x$ from $H$. Note that if $\beta$ was not assumed to be a unit vector, it would instead be proportional to the signed distance. If $f(x) >0$ we say that $x$ is above $H$, and if $f(x) < 0$ we say that $x$ is below $H$.

To classify a point $x$, we say it has label $1$ if above $H$, and $-1$ if below.

The Problem(s)

If this is possible, then the perceptron algorithm will converge to a hyperplane that correctly classifies all of the points.

One problem is that there are infinitely many such hyperplanes, intuitively we would like to choose one to maximize the distances of all the points from it. If there are points close to the decision boundary, then it is likely that unseen points from the same distribution might up on the wrong side of the hyperplane and be misclassified. The minimum distance of all the points to $H$ is called the margin of the decision boundary, so we would like to maximise the margin.

Another problem is that the data might be too mixed up for there to be a seperating hyperplane, no matter which one you choose some points get misclassified. What we need is a way of choosing the hyperplane that does the best job of trying to seperate the points.

Finally it may be that a hyperplane is a bad approximation to the true decision boundary, we need something wiggilier.

Let’s solve these problems in turn.

Maximizing the Margin

The first step in solving a machine learning problem is to figure out how to phrase it as an optimization problem, so let’s try that.

Leting $M$ denote the margin, we would like to

maximize $M$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge M$ for $i=1, \dots, N$, $M>0$ and $\Vert \beta \Vert =1$.

Note that $y_i f(x_i) >0$ implies that $x_i$ is correctly classified since their signs must be the same for the product to be positive.

Now we just do a litle finessing, notice that

$y_if(x_i) \ge M$, ie $y_i(\langle x_i, \beta \rangle -b) \ge M$

is the same as

$y_i(\langle x_i, \frac{\beta}{M} \rangle -\frac{b}{M}) \ge 1$,

and so the optimization problem is equivalent to:

maximize $M$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge 1$ for $i=1, \dots, N$, $M>0$ and $\Vert \beta \Vert =1/M$,

and so maximizing $M$ is the same as minimizing the norm of $\beta$ in this formulation, giving:

minimize $\Vert \beta \Vert$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge 1$ for $i=1, \dots, N$.

Finally minimizing the norm is the same as minimizing half the square of the norm, and the latter is much more convienient since it doesn’t have any of those pesky square roots:

minimize $\frac{1}{2}\Vert \beta \Vert^2$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge 1$ for $i=1, \dots, N$.

Now that we have the optimization problem setup, we use the Karush-Kuhn-Tucker conditions, normally when trying to minimize a function you look for places where the derivative is zero, and this the extension of that idea when you have constraints.

The conditions firstly say that

$\beta= \sum \limits _{i=1}^N \mu_i y_i x_i$,

and

$\sum \limits _{i=1}^n y_i \mu_i =0$,

where

$\mu_i \ge 0$ for $i=1, \dots, N$

and

$\mu_i y_i(1-\langle x_i, \beta \rangle)=0$ for $i=1, \dots, N$.

The last condition shows that $\mu_i > 0$ only for those $x_i$ such that $\langle x_i, \beta \rangle=1$, these are the $x_i$ that are ‘on the margin’, those input vectors closest to the optimal hyperplane.

This means that $\beta = \sum \limits _{i=1}^N \mu_i y_i x_i$ is determined only by these $x_i$ on the margin, we call these the support vectors.

The optimization problem can be solved using quadratic programming techniques, it is also sometimes useful to look at the equations given by the KKT conditions and convert it into the ‘dual’ optimization problem in terms of the $\alpha_i$.

Compromising

Life isn’t always easy, and sometimes the data is not linearly seperable. It is desirable that our algorithm be able to tolerate some errors and try to minimize them, otherwise, even if the data are linearly seperable, the algorithm may fall over itself trying to fit a few bad apples in the data, rather than going for a more natural fit that will hopefully generalize better.

We do this by introducing slack variables $\varepsilon_i \ge 0$ for $i=1, \dots, N$ and changing our constraint from

$y_if(x_i) \ge M$

to

$y_if(x_i) \ge M(1-\varepsilon_i)$,

and as before this changes to

$y_i f(x_i) \ge 1 -\varepsilon_i$.

Of course we want to minimize the errors, and so we can recast our optimization problem as

minimize $\frac{1}{2} \Vert \beta \Vert^2+ C\sum \limits_{i=1}^N\varepsilon_i$,

subject to $y_if(x_i) \ge 1 - \varepsilon_i$, $\varepsilon_i \ge 0$ for $i=1, \dots, N$,

where $C$ is a cost parameter that we need to choose, controlling our tolerance for mistakes.

The KKT conditions for this new problem are

$\beta = \sum \limits _{i=1}^N \alpha_i y_i x_i$,

$\sum \limits _{i=1}^n y_i \alpha_i =0$,

$\alpha_i = C - \mu_i$,

where $\mu_i \ge 0$ for $i=1, \dots, N$.

Instead of trying to minimize an objective function subject to several complex constraints, it is often easier to collect the constraints into the Lagrangian and minimize that, subject to simpler constraints, this is called the penalty method.

In our case the Lagrangian is

$L_P(\beta, b, \varepsilon, \alpha, \mu) = \frac{1}{2}\Vert \beta \Vert^2 + C \sum \limits _{i=1}^N \varepsilon_i - \sum \limits _{i=1}^n \mu_i \varepsilon_i - \sum \limits _{i=1}^N \alpha_i(y_i(x_i^T\beta+b)-(1-\varepsilon_i))$,

and minimizing our objective function is equivalent to minimizing $L_P$ with respect to $\beta, b, \varepsilon$ whilst at the same time maxmizing $L_P$ with respect to $\mu, \alpha$. This is why this is called a saddle-point method.

The constraints for this equivalent problem are $\alpha, \mu \ge 0$ and $\sum \limits_{i=1}^N \alpha_i y_i=0$, the others are taken care of automagically.

Now we can substitute the KKT condition equations above to simplify $L_P$, giving the dual Lagrangian

$L_D(\alpha) = \sum \limits _{i=1}^N \alpha_i - \frac{1}{2} \sum \limits_{i,j=1}^N \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle$.

So our new problem is

maximize $L_D$

subject to $\mu \ge 0, \alpha \ge 0$ and $\sum \limits _{i=1}^N \alpha_i y_i=0$,

but $\mu$ doesn’t appear in $L_D$! But then we notice that the KKT condition

$\alpha_i = C -\mu_i$

is equivalent to constraining $\alpha_i \in [0, C]$.

So our final formulation is

maximize $L_D$

subject to $0 \le \mu \le C$ and $\sum \limits _{i=1}^N \alpha_i y_i=0$.

A much simpler formulation! In fact this is a convex quadratic programming problem, so it can be solved using out of the box methods from a library.

The Kernel Trick

We’ve now seen how to cope with errors by using a cost parameter $C$, this is a form of regularization that can control overfitting. But what if the true decision boundary is geninuely something a bit more complex than a hyperplane? The solution to this problem is to change coordinates from $\mathbb{R}^k \to \mathbb{R}^m$ using some mapping $T$, then fit a hyperplane $H$ in this new space, then the pullback $T^{-1}H$ can be a nonlinear surface in $\mathbb{R}^k$.

This is of course a standard practice for any machine learning algorithm, feature engineering. What is special about support vector machines is that the function to be minimized is

$L_D(\alpha) = \sum \limits _{i=1}^N \alpha_i - \frac{1}{2} \sum \limits_{i,j=1}^N \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle$,

notice that you can compute this using only inner products in whatever coordinate system you use. This means that if you have a function

$k: \mathbb{R}^k \times \mathbb{R}^k \to \mathbb{R}$

such that

$k(x,y) = \langle T(x), T(y) \rangle$,

then there is no need to work directly in the transformed space at all! In fact you don’t even need to know what $T$ is, as long as you have $k$.

This is a big deal because this can be much more computationally efficient if you want to transform to a very high dimensional space. This is what is called the kernel trick, because we get to have the benefits of a richer feature space on the cheap.

So in order to do this, you just have to dream up a kernel $k$, to make sure that your $k$ actually defines an inner product in some (Hilbert) space, by Mercer’s Theorem what you need to check is that, for any choice of vectors $x_1, \dots, x_n$, the matrix $[k(x_i, x_j)]_{i,j}$ is positive semi-definite.

Some common choices are:

dth-degree polynomial kernel: $k(x,y)=(1+\langle x,y \rangle)^d$,

radial basis kernel: $k(x,y) = \exp(\gamma \Vert x-y \Vert^2)$.

I have decided not to do a python implementation this time, because I would have to get in to quadratic programming to do it properly, perhaps I will write a post on this topic in the future.