Home » machine learning » Enter the Perceptron

Enter the Perceptron

Introduction

Some time ago in my post on logistic regression, I talked about the problem where you have a n \times p matrix X representing n data points with p features, and a vector y \in \{0,1 \}^n of classifications.

In linear regression you would produce a vector \beta \in \mathbb{R}^p so that X\beta \approx y, but the problem here is that our output could be any real number, not \{0,1\}. The solution we investigated was using the logistic link function p(x) = \frac{1}{1+\exp(-x)} which turns any real number into something we interpret as a probability.

That was one solution, another even simpler one is the function defined s(x) = \left \{\begin{array}{cc} 1 & x \ge 0 \\ -1 &x<0\end{array} \right..  It is convenient in this case to think of the response as taking values in \{-1,1 \} not \{0,1\}.

Previously we added a column of ones to X in order to create an intercept term, here we will write it separately so so that X \mapsto X\beta + b where b \in \mathbb{R}^n is called the bias.

This post will take the book ‘The Elements of Statistical Learning’ as its guide (the section starting on p. 130), which is freely available online, but there is also a great blog post about it here.

The Algorithm

So define f(x):=\beta^Tx+b, we classify a point as being in the positive class if f(x)>0 and negative if f(x) <0. Geometrically we have an affine hyperplane L:=\{x:f(x)=0 \}, and points are classified depending on whether they are ‘above’ or ‘below’ the hyperplane.

The unit normal u to L is given by \beta/ \Vert \beta \Vert, points above L then mean points with positive projection onto this vector.

The signed distance of a point x to L is found by projecting the vector x-x_0 onto u, where x_0 is the ‘origin’ of the hyperplane (since an affine hyperplane is a translation of a linear subspace, there is a point corresponding to the translation of the origin). So we get

u^T(x-x_0)= \beta^T/ \Vert \beta \Vert (x-x_0) = (\beta^Tx+b)/ \Vert \beta \Vert = f(x)/ \Vert \beta \Vert = f(x)/ \Vert f'(x) \Vert,

notice this means that f(x) is proportional to the signed distance of x from L.

If M is the set of misclassified points, then we can measure the error as the sum of the distances (up to a constant of proportionality), of the misclassified points to L, that is

D(\beta, b):=-\sum \limits _{x \in M}y_x f(x),

where y_x is the response of x, notice that this makes each summand negative, which is reversed by the minus sign outside.

If we fix M we can differentiate with respect to \beta and b, giving

\frac{\partial}{\partial \beta}D(\beta, b)= -\sum \limits _{x \in M}y_xx,

and

\frac{\partial}{\partial b}D(\beta, b)= -\sum \limits _{x \in M}y_x.

Just like in logistic regression, we can use these partial derivatives to search for a minimum of D. The algorithm purposed by Frank Rosenblatt, the inventor of the perceptron, is to modify the hyperplane after looking at each individual misclassified item in M as it comes, that is (\beta, b) \mapsto (\beta,b) + \lambda(y_xx,y_x), where \lambda >0 is the learning rate. This has the advantage of being much faster than computing the global gradient informed by all of the data, and is an example of stochastic gradient descent.

Of course by changing the hyperplane as we go we may change M, and so what we actually do is iterate over each datapoint in a fixed order, and check if it is misclassified under the latest hyperplane, then update and move onto the next point. This is then very easy to code.

The Code

First let’s recycle some code we used for logistic regression to get some test data.

import numpy as np
from scipy import linalg
import random as rd
import matplotlib.pyplot as plt
def test_set(a,b, no_samples, intercept, slope, variance, offset):
    #First we grab some random x-values in the range [a,b].
    X1 = [rd.uniform(a,b) for i in range(no_samples)]
    #Sorting them seems to make the plots work better later on.
    X1.sort()
    #Now we define our linear function that will underly our synthetic dataset: the true decision   boundary.
    def g(x):
        return intercept + slope*x
    #We create two classes of points R,B. R points are given by g(x) plus a Gaussian noise term with positive mean,
    #B points are given by g(x) minus a Gaussian norm term with positive mean.
    R_turn=True
    X2=[] #This will hold the y-coordinates of our data.
    Y=[]  #This will hold the classification of our data.
    for x in X1:
        x2=g(x)
        if R_turn:
            x2 += rd.gauss(offset, variance)
            Y.append(1)
        else:
            x2 -= rd.gauss(offset, variance)
            Y.append(-1)
        R_turn=not R_turn
        X2.append(x2)
    #Now we combine the input data into a single matrix.
    X1 = np.array([X1]).T
    X2 = np.array([X2]).T
    X = np.hstack((X1, X2))
    #Now we return the input, output, and true decision boundary.
    return [ X,np.array(Y).T, map(g, X1)]

Fairly straightforward stuff I hope! If not then it should make sense when we plot it.

X,Y,g = test_set(a=0, b=5, no_samples=200, intercept=10, slope=0, variance=0.5, offset=1)
fig, ax = plt.subplots()
R =  X[::2] #Even terms
B =  X[1::2] #Odd terms

ax.scatter(R[:,0],R[:,1],marker='o', facecolor='red', label="Outcome 1")
ax.scatter(B[:,0],B[:,1],marker='o', facecolor='blue', label="Outcome 0")
ax.plot(X[:,0], g, color='green', label="True Decision Boundary")
ax.legend(loc="upper left", prop={'size':8})
plt.show()

This will produce something like:

perceptron_test_data

Now to implement the algorithm.

def perceptron(X,y, learning_rate=0.1, max_iter=1000):
    #Expecting X = n x m inputs, y = n x 1 outputs taking values in -1,1
    m=X.shape[1]
    n=X.shape[0]
    weight = np.zeros((m))
    bias = 0
    n_iter=0
    index = list(range(n))
    while n_iter <= max_iter:
        rd.shuffle(index)
        for row in index:
            if (X[row,:].dot(weight)+bias)*y[row] <=0: #Misclassified point.
                weight += learning_rate*X[row,:]*y[row]
                bias += learning_rate*y[row]
        n_iter+=1
    return {'weight':weight, 'bias':bias}

we shuffle the order we take the data points in at each step to get rid of any peculiarities in the order we update. Now let’s try it on our test data.

result= perceptron(X,Y,0.1)
def decision_boundary(x, weight, bias):
    return (-bias - x*weight[0])/weight[1]

fig, ax = plt.subplots()
R =  X[::2] #Even terms
B =  X[1::2] #Odd terms

ax.scatter(R[:,0],R[:,1],marker='o', facecolor='red', label="Outcome 1")
ax.scatter(B[:,0],B[:,1],marker='o', facecolor='blue', label="Outcome -1")
ax.plot(X[:,0], g, color='green', label="True Decision Boundary")
ax.plot(X[:,0], [decision_boundary(x, result['weight'], result['bias']) for x in X[:,0]],
        label='Predicted Decision Boundary')
ax.legend(loc="upper left", prop={'size':8})
plt.show()

Producing:

perceptron_test_resultsNot bad! Get the code here.

Advertisements

3 Comments

  1. […] time we looked at the perceptron model for classification, which is similar to logistic regression, in that it uses a nonlinear transformation of a linear […]

  2. […] premise is exactly the same as the perceptron algorithm, we have input data with class labels […]

  3. […] premise is exactly the same as the perceptron algorithm, we have input data with class labels […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: