Today we will be discussing a machine learning algorithm that is probably the most mathematically sophisticated of all the algorithms we have looked at so far.
Don’t let that daunt you though! Not all the details are essential for understanding the general idea, and it can certainly be used without understanding the maths, although if you are content with this then there’s not much point reading a blog like this!
This post is based on what I have learned from reading the relevant chapter in the excellent book: Elements of Statistical Learning.
The premise is exactly the same as the perceptron algorithm, we have input data with class labels .
In order to classify the inputs, we seek a seperating hyperplane . Recall that a hyperplane is just a plane that may not pass through the origin, and can be described as
for some and .
It is handy to assume that is a unit vector, and geometrically it is the normal vector to the hyperplane. I will refer to as the bias. If the bias is zero, then is just a regular plane, and if it is nonzero is parallel to this plane, the bias controls how far ‘above’ or ‘below’ the origin is. In fact if , then .
If we define
Moreover, is the signed distance of from . Note that if was not assumed to be a unit vector, it would instead be proportional to the signed distance. If we say that is above , and if we say that is below .
To classify a point , we say it has label if above , and if below.
If this is possible, then the perceptron algorithm will converge to a hyperplane that correctly classifies all of the points.
One problem is that there are infinitely many such hyperplanes, intuitively we would like to choose one to maximize the distances of all the points from it. If there are points close to the decision boundary, then it is likely that unseen points from the same distribution might up on the wrong side of the hyperplane and be misclassified. The minimum distance of all the points to is called the margin of the decision boundary, so we would like to maximise the margin.
Another problem is that the data might be too mixed up for there to be a seperating hyperplane, no matter which one you choose some points get misclassified. What we need is a way of choosing the hyperplane that does the best job of trying to seperate the points.
Finally it may be that a hyperplane is a bad approximation to the true decision boundary, we need something wiggilier.
Let’s solve these problems in turn.
Maximizing the Margin
The first step in solving a machine learning problem is to figure out how to phrase it as an optimization problem, so let’s try that.
Leting denote the margin, we would like to
maximize over ,
subject to for , and .
Note that implies that is correctly classified since their signs must be the same for the product to be positive.
Now we just do a litle finessing, notice that
is the same as
and so the optimization problem is equivalent to:
maximize over ,
subject to for , and ,
and so maximizing is the same as minimizing the norm of in this formulation, giving:
minimize over ,
subject to for .
Finally minimizing the norm is the same as minimizing half the square of the norm, and the latter is much more convienient since it doesn’t have any of those pesky square roots:
minimize over ,
subject to for .
Now that we have the optimization problem setup, we use the Karush-Kuhn-Tucker conditions, normally when trying to minimize a function you look for places where the derivative is zero, and this the extension of that idea when you have constraints.
The conditions firstly say that
The last condition shows that only for those such that , these are the that are ‘on the margin’, those input vectors closest to the optimal hyperplane.
This means that is determined only by these on the margin, we call these the support vectors.
The optimization problem can be solved using quadratic programming techniques, it is also sometimes useful to look at the equations given by the KKT conditions and convert it into the ‘dual’ optimization problem in terms of the .
Life isn’t always easy, and sometimes the data is not linearly seperable. It is desirable that our algorithm be able to tolerate some errors and try to minimize them, otherwise, even if the data are linearly seperable, the algorithm may fall over itself trying to fit a few bad apples in the data, rather than going for a more natural fit that will hopefully generalize better.
We do this by introducing slack variables for and changing our constraint from
and as before this changes to
Of course we want to minimize the errors, and so we can recast our optimization problem as
subject to , for ,
where is a cost parameter that we need to choose, controlling our tolerance for mistakes.
The KKT conditions for this new problem are
where for .
Instead of trying to minimize an objective function subject to several complex constraints, it is often easier to collect the constraints into the Lagrangian and minimize that, subject to simpler constraints, this is called the penalty method.
In our case the Lagrangian is
and minimizing our objective function is equivalent to minimizing with respect to whilst at the same time maxmizing with respect to . This is why this is called a saddle-point method.
The constraints for this equivalent problem are and , the others are taken care of automagically.
Now we can substitute the KKT condition equations above to simplify , giving the dual Lagrangian
So our new problem is
subject to and ,
but doesn’t appear in ! But then we notice that the KKT condition
is equivalent to constraining .
So our final formulation is
subject to and .
A much simpler formulation! In fact this is a convex quadratic programming problem, so it can be solved using out of the box methods from a library.
The Kernel Trick
We’ve now seen how to cope with errors by using a cost parameter , this is a form of regularization that can control overfitting. But what if the true decision boundary is geninuely something a bit more complex than a hyperplane? The solution to this problem is to change coordinates from using some mapping , then fit a hyperplane in this new space, then the pullback can be a nonlinear surface in .
This is of course a standard practice for any machine learning algorithm, feature engineering. What is special about support vector machines is that the function to be minimized is
notice that you can compute this using only inner products in whatever coordinate system you use. This means that if you have a function
then there is no need to work directly in the transformed space at all! In fact you don’t even need to know what is, as long as you have .
This is a big deal because this can be much more computationally efficient if you want to transform to a very high dimensional space. This is what is called the kernel trick, because we get to have the benefits of a richer feature space on the cheap.
So in order to do this, you just have to dream up a kernel , to make sure that your actually defines an inner product in some (Hilbert) space, by Mercer’s Theorem what you need to check is that, for any choice of vectors , the matrix is positive semi-definite.
Some common choices are:
dth-degree polynomial kernel: ,
radial basis kernel: .
I have decided not to do a python implementation this time, because I would have to get in to quadratic programming to do it properly, perhaps I will write a post on this topic in the future.