**Introduction**

Some time ago in my post on logistic regression, I talked about the problem where you have a matrix representing data points with features, and a vector of classifications.

In linear regression you would produce a vector so that , but the problem here is that our output could be any real number, not . The solution we investigated was using the logistic link function which turns any real number into something we interpret as a probability.

That was one solution, another even simpler one is the function defined . It is convenient in this case to think of the response as taking values in not .

Previously we added a column of ones to in order to create an intercept term, here we will write it separately so so that where is called the *bias*.

This post will take the book ‘The Elements of Statistical Learning’ as its guide (the section starting on p. 130), which is freely available online, but there is also a great blog post about it here.

**The Algorithm**

So define , we classify a point as being in the positive class if and negative if . Geometrically we have an affine hyperplane , and points are classified depending on whether they are ‘above’ or ‘below’ the hyperplane.

The unit normal to is given by , points above then mean points with positive projection onto this vector.

The signed distance of a point to is found by projecting the vector onto , where is the ‘origin’ of the hyperplane (since an affine hyperplane is a translation of a linear subspace, there is a point corresponding to the translation of the origin). So we get

,

notice this means that is proportional to the signed distance of from .

If is the set of misclassified points, then we can measure the error as the sum of the distances (up to a constant of proportionality), of the misclassified points to , that is

,

where is the response of , notice that this makes each summand negative, which is reversed by the minus sign outside.

If we fix we can differentiate with respect to and , giving

,

and

.

Just like in logistic regression, we can use these partial derivatives to search for a minimum of . The algorithm purposed by Frank Rosenblatt, the inventor of the perceptron, is to modify the hyperplane after looking at each individual misclassified item in as it comes, that is , where is the learning rate. This has the advantage of being much faster than computing the global gradient informed by all of the data, and is an example of stochastic gradient descent.

Of course by changing the hyperplane as we go we may change , and so what we actually do is iterate over each datapoint in a fixed order, and check if it is misclassified under the latest hyperplane, then update and move onto the next point. This is then very easy to code.

**The Code**

First let’s recycle some code we used for logistic regression to get some test data.

import numpy as np from scipy import linalg import random as rd import matplotlib.pyplot as plt def test_set(a,b, no_samples, intercept, slope, variance, offset): #First we grab some random x-values in the range [a,b]. X1 = [rd.uniform(a,b) for i in range(no_samples)] #Sorting them seems to make the plots work better later on. X1.sort() #Now we define our linear function that will underly our synthetic dataset: the true decision boundary. def g(x): return intercept + slope*x #We create two classes of points R,B. R points are given by g(x) plus a Gaussian noise term with positive mean, #B points are given by g(x) minus a Gaussian norm term with positive mean. R_turn=True X2=[] #This will hold the y-coordinates of our data. Y=[] #This will hold the classification of our data. for x in X1: x2=g(x) if R_turn: x2 += rd.gauss(offset, variance) Y.append(1) else: x2 -= rd.gauss(offset, variance) Y.append(-1) R_turn=not R_turn X2.append(x2) #Now we combine the input data into a single matrix. X1 = np.array([X1]).T X2 = np.array([X2]).T X = np.hstack((X1, X2)) #Now we return the input, output, and true decision boundary. return [ X,np.array(Y).T, map(g, X1)]

Fairly straightforward stuff I hope! If not then it should make sense when we plot it.

X,Y,g = test_set(a=0, b=5, no_samples=200, intercept=10, slope=0, variance=0.5, offset=1) fig, ax = plt.subplots() R = X[::2] #Even terms B = X[1::2] #Odd terms ax.scatter(R[:,0],R[:,1],marker='o', facecolor='red', label="Outcome 1") ax.scatter(B[:,0],B[:,1],marker='o', facecolor='blue', label="Outcome 0") ax.plot(X[:,0], g, color='green', label="True Decision Boundary") ax.legend(loc="upper left", prop={'size':8}) plt.show()

This will produce something like:

Now to implement the algorithm.

def perceptron(X,y, learning_rate=0.1, max_iter=1000): #Expecting X = n x m inputs, y = n x 1 outputs taking values in -1,1 m=X.shape[1] n=X.shape[0] weight = np.zeros((m)) bias = 0 n_iter=0 index = list(range(n)) while n_iter <= max_iter: rd.shuffle(index) for row in index: if (X[row,:].dot(weight)+bias)*y[row] <=0: #Misclassified point. weight += learning_rate*X[row,:]*y[row] bias += learning_rate*y[row] n_iter+=1 return {'weight':weight, 'bias':bias}

we shuffle the order we take the data points in at each step to get rid of any peculiarities in the order we update. Now let’s try it on our test data.

result= perceptron(X,Y,0.1) def decision_boundary(x, weight, bias): return (-bias - x*weight[0])/weight[1] fig, ax = plt.subplots() R = X[::2] #Even terms B = X[1::2] #Odd terms ax.scatter(R[:,0],R[:,1],marker='o', facecolor='red', label="Outcome 1") ax.scatter(B[:,0],B[:,1],marker='o', facecolor='blue', label="Outcome -1") ax.plot(X[:,0], g, color='green', label="True Decision Boundary") ax.plot(X[:,0], [decision_boundary(x, result['weight'], result['bias']) for x in X[:,0]], label='Predicted Decision Boundary') ax.legend(loc="upper left", prop={'size':8}) plt.show()

Producing:

Not bad! Get the code here.

[…] time we looked at the perceptron model for classification, which is similar to logistic regression, in that it uses a nonlinear transformation of a linear […]

[…] premise is exactly the same as the perceptron algorithm, we have input data with class labels […]

[…] premise is exactly the same as the perceptron algorithm, we have input data with class labels […]