Home » machine learning

Category Archives: machine learning


Support Vector Machines and the Kernel Trick

Today we will be discussing a machine learning algorithm that is probably the most mathematically sophisticated of all the algorithms we have looked at so far.

Don’t let that daunt you though! Not all the details are essential for understanding the general idea, and it can certainly be used without understanding the maths, although if you are content with this then there’s not much point reading a blog like this!

This post is based on what I have learned from reading the relevant chapter in the excellent book: Elements of Statistical Learning.

The Setup

The premise is exactly the same as the perceptron algorithm, we have input data x_1, \dots , x_N \in \mathbb{R}^k with class labels y_1, \dots, y_N \in \{-1,1\}.

In order to classify the inputs, we seek a seperating hyperplane H. Recall that a hyperplane is just a plane that may not pass through the origin, and can be described as

H= \{ x \in \mathbb{R}^k : \langle x, \beta \rangle =b \} for some \beta \in \mathbb{R}^k and b \in \mathbb{R}.

It is handy to assume that \beta is a unit vector, and geometrically it is the normal vector to the hyperplane. I will refer to b as the bias. If the bias is zero, then H is just a regular plane, and if it is nonzero H is parallel to this plane, the bias controls how far ‘above’ or ‘below’ the origin H is. In fact if H_0 =\{x: \langle x,\beta \rangle =0 \}, then H=H_0 + b\beta.

If we define

f: \mathbb{R}^k \to \mathbb{R}


f(x) = \langle x,\beta \rangle - b,

then H = f^{-1}(0).

Moreover, f(x) is the signed distance of x from H. Note that if \beta was not assumed to be a unit vector, it would instead be proportional to the signed distance. If f(x) >0 we say that x is above H, and if f(x) < 0 we say that x is below H.

To classify a point x, we say it has label 1 if above H, and -1 if below.

The Problem(s)

If this is possible, then the perceptron algorithm will converge to a hyperplane that correctly classifies all of the points.

One problem is that there are infinitely many such hyperplanes, intuitively we would like to choose one to maximize the distances of all the points from it. If there are points close to the decision boundary, then it is likely that unseen points from the same distribution might up on the wrong side of the hyperplane and be misclassified. The minimum distance of all the points to H is called the margin of the decision boundary, so we would like to maximise the margin.

Another problem is that the data might be too mixed up for there to be a seperating hyperplane, no matter which one you choose some points get misclassified. What we need is a way of choosing the hyperplane that does the best job of trying to seperate the points.

Finally it may be that a hyperplane is a bad approximation to the true decision boundary, we need something wiggilier.

Let’s solve these problems in turn.

Maximizing the Margin

The first step in solving a machine learning problem is to figure out how to phrase it as an optimization problem, so let’s try that.

Leting M denote the margin, we would like to

maximize M over \{\beta, b\},
subject to y_if(x_i) \ge M for i=1, \dots, N, M>0 and \Vert \beta \Vert =1.

Note that y_i f(x_i) >0 implies that x_i is correctly classified since their signs must be the same for the product to be positive.

Now we just do a litle finessing, notice that

y_if(x_i) \ge M, ie y_i(\langle x_i, \beta \rangle -b) \ge M

is the same as

y_i(\langle x_i, \frac{\beta}{M} \rangle -\frac{b}{M}) \ge 1,

and so the optimization problem is equivalent to:

maximize M over \{\beta, b\},
subject to y_if(x_i) \ge 1 for i=1, \dots, N, M>0 and \Vert \beta \Vert =1/M,

and so maximizing M is the same as minimizing the norm of \beta in this formulation, giving:

minimize \Vert \beta \Vert over \{\beta, b\},
subject to y_if(x_i) \ge 1 for i=1, \dots, N.

Finally minimizing the norm is the same as minimizing half the square of the norm, and the latter is much more convienient since it doesn’t have any of those pesky square roots:

minimize \frac{1}{2}\Vert \beta \Vert^2 over \{\beta, b\},
subject to y_if(x_i) \ge 1 for i=1, \dots, N.

Now that we have the optimization problem setup, we use the Karush-Kuhn-Tucker conditions, normally when trying to minimize a function you look for places where the derivative is zero, and this the extension of that idea when you have constraints.

The conditions firstly say that

\beta= \sum \limits _{i=1}^N \mu_i y_i x_i,


\sum \limits _{i=1}^n y_i \mu_i =0,


\mu_i \ge 0 for i=1, \dots, N


\mu_i y_i(1-\langle x_i, \beta \rangle)=0 for i=1, \dots, N.

The last condition shows that \mu_i > 0 only for those x_i such that \langle x_i, \beta \rangle=1, these are the x_i that are ‘on the margin’, those input vectors closest to the optimal hyperplane.

This means that \beta = \sum \limits _{i=1}^N \mu_i y_i x_i is determined only by these x_i on the margin, we call these the support vectors.

The optimization problem can be solved using quadratic programming techniques, it is also sometimes useful to look at the equations given by the KKT conditions and convert it into the ‘dual’ optimization problem in terms of the \alpha_i.


Life isn’t always easy, and sometimes the data is not linearly seperable. It is desirable that our algorithm be able to tolerate some errors and try to minimize them, otherwise, even if the data are linearly seperable, the algorithm may fall over itself trying to fit a few bad apples in the data, rather than going for a more natural fit that will hopefully generalize better.

We do this by introducing slack variables \varepsilon_i \ge 0 for i=1, \dots, N and changing our constraint from

y_if(x_i) \ge M


y_if(x_i) \ge M(1-\varepsilon_i),

and as before this changes to

y_i f(x_i) \ge 1 -\varepsilon_i.

Of course we want to minimize the errors, and so we can recast our optimization problem as

minimize \frac{1}{2} \Vert \beta \Vert^2+ C\sum \limits_{i=1}^N\varepsilon_i,

subject to y_if(x_i) \ge 1 - \varepsilon_i, \varepsilon_i \ge 0 for i=1, \dots, N,

where C is a cost parameter that we need to choose, controlling our tolerance for mistakes.

The KKT conditions for this new problem are

\beta = \sum \limits _{i=1}^N \alpha_i y_i x_i,

\sum \limits _{i=1}^n y_i \alpha_i =0,

\alpha_i = C - \mu_i,

where \mu_i \ge 0 for i=1, \dots, N.

Instead of trying to minimize an objective function subject to several complex constraints, it is often easier to collect the constraints into the Lagrangian and minimize that, subject to simpler constraints, this is called the penalty method.

In our case the Lagrangian is

L_P(\beta, b, \varepsilon, \alpha, \mu) = \frac{1}{2}\Vert \beta \Vert^2 + C \sum \limits _{i=1}^N \varepsilon_i - \sum \limits _{i=1}^n \mu_i \varepsilon_i - \sum \limits _{i=1}^N \alpha_i(y_i(x_i^T\beta+b)-(1-\varepsilon_i)),

and minimizing our objective function is equivalent to minimizing L_P with respect to \beta, b, \varepsilon whilst at the same time maxmizing L_P with respect to \mu, \alpha. This is why this is called a saddle-point method.

The constraints for this equivalent problem are \alpha, \mu \ge 0 and \sum \limits_{i=1}^N \alpha_i y_i=0, the others are taken care of automagically.

Now we can substitute the KKT condition equations above to simplify L_P, giving the dual Lagrangian

L_D(\alpha) = \sum \limits _{i=1}^N \alpha_i - \frac{1}{2} \sum \limits_{i,j=1}^N \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle.

So our new problem is

maximize L_D

subject to \mu \ge 0, \alpha \ge 0 and \sum \limits _{i=1}^N \alpha_i y_i=0,

but \mu doesn’t appear in L_D! But then we notice that the KKT condition

\alpha_i = C -\mu_i

is equivalent to constraining \alpha_i \in [0, C].

So our final formulation is

maximize L_D

subject to 0 \le \mu \le C and \sum \limits _{i=1}^N \alpha_i y_i=0.

A much simpler formulation! In fact this is a convex quadratic programming problem, so it can be solved using out of the box methods from a library.

The Kernel Trick

We’ve now seen how to cope with errors by using a cost parameter C, this is a form of regularization that can control overfitting. But what if the true decision boundary is geninuely something a bit more complex than a hyperplane? The solution to this problem is to change coordinates from \mathbb{R}^k \to \mathbb{R}^m using some mapping T, then fit a hyperplane H in this new space, then the pullback T^{-1}H can be a nonlinear surface in \mathbb{R}^k.

This is of course a standard practice for any machine learning algorithm, feature engineering. What is special about support vector machines is that the function to be minimized is

L_D(\alpha) = \sum \limits _{i=1}^N \alpha_i - \frac{1}{2} \sum \limits_{i,j=1}^N \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle,

notice that you can compute this using only inner products in whatever coordinate system you use. This means that if you have a function

k: \mathbb{R}^k \times \mathbb{R}^k \to \mathbb{R}

such that

k(x,y) = \langle T(x), T(y) \rangle,

then there is no need to work directly in the transformed space at all! In fact you don’t even need to know what T is, as long as you have k.

This is a big deal because this can be much more computationally efficient if you want to transform to a very high dimensional space. This is what is called the kernel trick, because we get to have the benefits of a richer feature space on the cheap.

So in order to do this, you just have to dream up a kernel k, to make sure that your k actually defines an inner product in some (Hilbert) space, by Mercer’s Theorem what you need to check is that, for any choice of vectors x_1, \dots, x_n, the matrix [k(x_i, x_j)]_{i,j} is positive semi-definite.

Some common choices are:

dth-degree polynomial kernel: k(x,y)=(1+\langle x,y \rangle)^d,

radial basis kernel: k(x,y) = \exp(\gamma \Vert x-y \Vert^2).

I have decided not to do a python implementation this time, because I would have to get in to quadratic programming to do it properly, perhaps I will write a post on this topic in the future.


Theano, Autoencoders and MNIST

Recently I have finally upgraded my ancient laptop. Other than the ability for me to play the occasional video game, this means that I now have a dedicated Nvidia graphics card, in particular one that supports something called CUDA.

Now that I have the hardware, I thought that it would be fun to check out this python library theano. As I understand it, it has two very interesting aspects.

One is that it does symbolic computation, that is it represents variables as in mathematics, and their manipulations.

A variable in computer science is something like a integer or a string or  a list etc…, a piece of information stored somewhere in memory with a name made up by the programmer, and the compiler takes care of associating the human understandable name to the address of the information in memory.

A variable in mathematics is like x in x^2 + x + 1, an indeterminate quantity that can be manipulated using arithmetic operations, or using calculus and so on. So what in mathematics we call a variable corresponds roughly to a function in computer science.

The difference between symbolic computation and its variables, and regular functions, is that the programming language is aware of the mathematical nature of a variable, such that when you apply operations to variables to create new variables, it does this in an efficient way. The best way to understand this is a simple example. Let’s suppose we have two functions f(x) =x, g(x) = -x. Now we can combine these functions by adding them to make a new function h(x) = f(x) + g(x).

Now in python say, when you call h with an argument, it will pass this argument to both f and g and then add the result and return it.

If we are doing symbolic computation, and f,g were variables with f=-g, then the variable h=f+g=0 constantly. And so the computer does no work at all when using h, because it knows that it must always be zero.

So one important aspect is that symbolic computation can simplify and hence often optimize functions automatically.

Another is that theano can automagically differentiate variables with respect to variables! This might not sound like a big deal, but it is. In machine learning, we often create a complicated function, like in logistic regression or neural networks, and then derive a helpful form for the derivative to do gradient descent to optimize it. In theano, all you have to do is specify the objective function, and it will let you do gradient descent automatically.

The second amazing thing about theano is that it compiles its theano functions into C, and so it often runs much much faster than native python or numpy. It can also make it easier to run computations on a gpu which can mean a 100x speedup for the right cases!

Hopefully I’ve made you curious, so  click here for an installation guide. On this linux machine it was very simple. I am dual booting windows 8.1 and Ubuntu at the moment, it is very tricky to get the gpu aspect to work on windows, but here is a guide for the brave. To check that the GPU aspect is working, run the test code here.

Introduction to Theano

I recommend that you read the tutorials here, and also this excellent ipython worksheet here.

Let’s little play with the basic functionality.

import theano as th
from theano import tensor as T

#To make a variable, specify the type of variable from the tensor

#We can have scalars.

#Note that the name is just a string that theano will use
#to communicate to us about the variable.
print x
print x.type
#We can also have vectors.
v = T.dvector(name='vector_v')
print v
print v.type
#And matrices.
A = T.dmatrix(name='matrix_A')
print A
print A.type

#We can also make new variables using standard mathematical operations.
x_2 = x*x
print x_2
print x_2.type

#We can also make python functions which return new variables.
def power(variable, n):
    return variable**n
x_10 = power(x,10)
print x_10
print x_10.type

#We can of course do standard linear algebra operations also.
Av = T.dot(A,v)
print Av
print Av.type

#Of course when a variable is actually evaluated, you must 
#ensure that the dimensions are correct, or you will
#get an error.

#To see the value of a variable for a particular value of the variables
#comprising it, we make a theano function.
f = th.function([A,v], [Av])
#The syntax is a list of input variables, and then a list of output variables.

#The python code takes a little longer to run initially, because thenao
#compiles the function into C, but thereafter it will run extremely fast.

#Let's try using the function.
import numpy as np
print m
vec = np.asarray([1,2])
print vec
print f(m,vec)

#Now we can try computing gradients.
#First let's make a scalar variable by taking an inner product.
vTw = T.dot(v,w)

#Now we take the gradient with respect to w.
vTw_grad = T.grad(vTw,w)
print vTw_grad
print vTw_grad.type

#Now let's test it.
vec1 = np.asarray([1,2])
vec2 = np.asarray([0,0])

#To evaulate a variable given inputs, there is another syntax
#in additon to creating a thenano function.
print vTw_grad.eval({w:vec1,v:vec2})

#Next we discuss theano's shared variables,
#these differ from regular theano variables in that
#theano variables only have a value within a theano function
#whereas shared theano variables have a value independent of 
#being called in a function.
w=T.shared(name='shared_matrix', value=np.ones((2,3)))
print w
print w.type
print w.get_value()

#You can also set a shared variable's value.
print w.get_value()

#You can also have theano functions update shared variables,
#we illustrate this with a silly example that updates a matrix
#like in gradient descent.
cost = wx.mean()
cost_grad = T.grad(cost, w)

f=th.function(inputs=[x], outputs=[wx, cost], updates=[(w, w-0.1*cost_grad)])
#Notice the syntax of updates argument, should be a list
#of two tuples of the form: (variable_to_be_updated, updated_variable).
print f([1,1,1])
print w.get_value()

If you want to tinker with this, get the sourcecode here.


I don’t want to go into too much detail here, but an autoencoder is a feedforward neural network with a single hidden layer with the twist being that its target output is exactly the same as its input. If the number of hidden neurons is less than the number of input/output neurons, then the network will be forced to learn some sort of compression.

Even when the number of hidden neurons is large, there are various tricks in cost functions you can use to ensure that the network does something more interesting than learning the identity function.

The hope of an autoencoder is that it learns some structure in the data, so that the hidden layer can be thought of as a feature space that one can use for a supervised learning algorithm.

Basically then we have x \mapsto f_1(W_1x+b_1) =h the hidden layer, then h \mapsto f_2(W_2h+b_2), where b_1, b_2 are bias vectors, W_1, W_2 are matrices, and f_1, f_2 are element-wise non-linearities like the logistic function.

A simplification that is often made is to enforce W_1 = W_2^T. This gives the model less free parameters to learn, and so is a form of regularization. It also uses less memory. This is referred to as an autoencoder with tied weights.

Autoencoders with Theano

Thanks to automatic differentiation, backprogation is effortless with theano, and thanks to gpu computation, much faster also! I recommend having a look at my python implementation of backpropagation just so you can see the effort saved. This implementation benefited greatly from this tutorial.

import numpy as np
import theano as th
from theano import tensor as T
from numpy import random as rng

class AutoEncoder(object):
    def __init__(self, X, hidden_size, activation_function,
        #X is the data, an m x n numpy matrix
        #where rows correspond to datapoints
        #and columns correspond to features.
        assert type(X) is np.ndarray
        assert len(X.shape)==2
        self.X=th.shared(name='X', value=np.asarray(self.X, 
        #The config.floatX and borrow=True stuff is to get this to run
        #fast on the gpu. I recommend just doing this without thinking about
        #it until you understand the code as a whole, then learning more
        #about gpus and theano.
        self.n = X.shape[1]
        self.m = X.shape[0]
        #Hidden_size is the number of neurons in the hidden layer, an int.
        assert type(hidden_size) is int
        assert hidden_size > 0
        initial_W = np.asarray(rng.uniform(
                 low=-4 * np.sqrt(6. / (self.hidden_size + self.n)),
                 high=4 * np.sqrt(6. / (self.hidden_size + self.n)),
                 size=(self.n, self.hidden_size)), dtype=th.config.floatX)
        self.W = th.shared(value=initial_W, name='W', borrow=True)
        self.b1 = th.shared(name='b1', value=np.zeros(shape=(self.hidden_size,),
        self.b2 = th.shared(name='b2', value=np.zeros(shape=(self.n,),
    def train(self, n_epochs=100, mini_batch_size=1, learning_rate=0.1):
        index = T.lscalar()
        params = [self.W, self.b1, self.b2]
        hidden = self.activation_function(T.dot(x, self.W)+self.b1)
        output = T.dot(hidden,T.transpose(self.W))+self.b2
        output = self.output_function(output)
        #Use cross-entropy loss.
        L = -T.sum(x*T.log(output) + (1-x)*T.log(1-output), axis=1)
        #Return gradient with respect to W, b1, b2.
        gparams = T.grad(cost,params)
        #Create a list of 2 tuples for updates.
        for param, gparam in zip(params, gparams):
            updates.append((param, param-learning_rate*gparam))
        #Train given a mini-batch of the data.
        train = th.function(inputs=[index], outputs=[cost], updates=updates,

        import time
        start_time = time.clock()
        for epoch in xrange(n_epochs):
            print "Epoch:",epoch
            for row in xrange(0,self.m, mini_batch_size):
        end_time = time.clock()
        print "Average time per epoch=", (end_time-start_time)/n_epochs
    def get_hidden(self,data):
        hidden = self.activation_function(T.dot(x,self.W)+self.b1)
        transformed_data = th.function(inputs=[x], outputs=[hidden])
        return transformed_data(data)
    def get_weights(self):
        return [self.W.get_value(), self.b1.get_value(), self.b2.get_value()]

And that is it! Now we should test this puppy out, so let’s find some data.


The MNIST data is famous in machine learning circles, it consists of single handwritten digits. Nearly every paper written on neural networks and so on tests their contribution on the data. The task is to classify the digits, but we will just test our autoencoder. First we will steal a very nice helper function to load the data set, this will download it for you if you don’t have it already. You can find it here.

import cPickle
import gzip
import os

def load_data(dataset):
    ''' Loads the dataset

    :type dataset: string
    :param dataset: the path to the dataset (here MNIST)

    # LOAD DATA #

    # Download the MNIST dataset if it is not present
    data_dir, data_file = os.path.split(dataset)
    if data_dir == "" and not os.path.isfile(dataset):
        # Check if dataset is in the data directory.
        new_path = os.path.join(os.path.split(__file__)[0], "..", "data", dataset)
        if os.path.isfile(new_path) or data_file == 'mnist.pkl.gz':
            dataset = new_path

    if (not os.path.isfile(dataset)) and data_file == 'mnist.pkl.gz':
        import urllib
        origin = 'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz'
        print 'Downloading data from %s' % origin
        urllib.urlretrieve(origin, dataset)

    print '... loading data'

    # Load the dataset
    f = gzip.open(dataset, 'rb')
    train_set, valid_set, test_set = cPickle.load(f)
    #train_set, valid_set, test_set format: tuple(input, target)
    #input is an numpy.ndarray of 2 dimensions (a matrix)
    #which row's correspond to an example. target is a
    #numpy.ndarray of 1 dimensions (vector)) that have the same length as
    #the number of rows in the input. It should give the target
    #target to the example with the same index in the input.

    return (train_set, valid_set, test_set)



We also want a plotting function to visualize the features learned, I won’t explain this here, but if you google “visualizing features of neural network” you can learn plenty about it.

def plot_first_k_numbers(X,k):
    from matplotlib import mpl,pyplot
    j = int(round(k / 10.0))
    fig, ax = pyplot.subplots(j,10)
    for i in range(k):


        ax[i/10, i%10].imshow(w,cmap=pyplot.cm.gist_yarg,
                      interpolation='nearest', aspect='equal')
        ax[i/10, i%10].axis('off')

        axis='x',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        bottom='off',      # ticks along the bottom edge are off
        top='off',         # ticks along the top edge are off
        axis='y',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        right='off',    # ticks along the top edge are off

Now we put it together.

def m_test(data):
    activation_function = T.nnet.sigmoid
    A = AutoEncoder(X, 500, activation_function, output_function)
    plot_first_k_numbers(W, 100, False)
    plot_first_k_numbers(W,100, True)


Running this, with 500 hidden units for 20 epochs with minibatch size of 20, the following features were learned:

features_learned_500h20_20Nothing too interesting yet, in order to get more useful features we could change the cost function to encourage sparsity, or introduce noise into the inputs, or use a different activation function. Hopefully this has been helpful for those just starting out with theano. Here is the source code.

Neural Networks Part 2: Python Implementation

Ok so last time we introduced the feedforward neural network. We discussed how input gets fed forward to become output, and the backpropagation algorithm for learning the weights of the edges.

Today we will begin by showing how the model can be expressed using matrix notation, under the assumption that the neural network is fully connected, that is each neuron is connected to all the neurons in the next layer.

Once this is done we will give a Python implementation and test it out.

Matrix Notation For Neural Networks

Most of  this I learned from here.

In what follows, vectors are always thought of as columns, and so the transpose a row.

So first off we have x \in \mathbb{R}^k, our input vector, and y \in \mathbb{R}^m our output vector.

Our neural network N will have N layers L_1, \dots , L_n with L_1   the input layer, and L_n the output layer.

For each layer L_i:

  • write n_i:=|L_i| for the number of neurons in that layer (not including the bias neurons);
  • write f_i: \mathbb{R}^{n_i} \to \mathbb{R}^{n_i} for the layer’s activation function;
  • write f_i': \mathbb{R}^{n_i} \to \mathbb{R}^{n_i} for vector of derivatives of the activation functions;
  • write I_i \in \mathbb{R}^{n_i} for the vector of inputs to the ith layer;
  • write b_i \in \mathbb{R}^{n_{i}} for the vector of biases;
  • write O_i \in \mathbb{R}^{n_i}= f_i(I_i) for the vector of outputs from the ith layer;
  • write O_i' \in \mathbb{R}^{n_i}= f_i'(I_i);
  • write W_i for the n_{i+1} \times n_i real matrix containing the weights of the edges going from L_i \to L_{i+1} (not including bias neurons), so that W_iO_i + b_i = I_{i+1}.

Feeding forward in vector notation

So to feedforward the input we have

x \mapsto W_1x +b_1 \mapsto f_2(W_1x+b_1) \mapsto W_2f_2(W_1x)+b_2 \mapsto \dots \mapsto f_n(W_{n-1} f_{n-1}( \dots W_2f_2(W_1x) )).

And we call this output f_N(x).

Backpropagation in vector notation

Now write E_i \in \mathbb{R}^{n_i} for the error at the layer i defined E_i:=\frac{\partial e}{\partial I_i} .

Then E_n = O_n' \cdot (O_n-y), and E_{i-1} = \left (W_{i-1}^TE_i\right ) \cdot O_{i-1}', where the multiplication \cdot is component wise.

Write \triangle W_i for the matrix with entries (\frac{\partial e}{\partial w_{ij}})_{ij}, for non-bias weights.

Then we have \triangle W_i = \cdot E_{i+1} \cdot O_i^T, the outer product of E_{i+1} and O_i.

And for the biases write \triangle b_i \in \mathbb{R}^{n_i} for the vector with jth component \frac{\partial e}{\partial (b_i)_j}.

Then \triangle b_i =E_{i+1}.

Python Implementation

We will create a class NeuralNetwork and perform our calculations using matrices in numpy.

import numpy as np

class NeuralNetwork(object):
    def __init__(self, X, y, parameters):
        #Input data
        #Output data
        #Expect parameters to be a tuple of the form:
        #    ((n_input,0,0), (n_hidden_layer_1, f_1, f_1'), ...,
        #     (n_hidden_layer_k, f_k, f_k'), (n_output, f_o, f_o'))
        self.n_layers = len(parameters)
        #Counts number of neurons without bias neurons in each layer.
        self.sizes = [layer[0] for layer in parameters]
        #Activation functions for each layer.
        self.fs =[layer[1] for layer in parameters]
        #Derivatives of activation functions for each layer.
        self.fprimes = [layer[2] for layer in parameters]

    def build_network(self):
        #List of weight matrices taking the output of one layer to the input of the next.
        #Bias vector for each layer.
        #Input vector for each layer.
        #Output vector for each layer.
        #Vector of errors at each layer.
        #We initialise the weights randomly, and fill the other vectors with 1s.
        for layer in range(self.n_layers-1):
            n = self.sizes[layer]
            m = self.sizes[layer+1]
            self.weights.append(np.random.normal(0,1, (m,n)))
        #There are only n-1 weight matrices, so we do the last case separately.
        n = self.sizes[-1]

    def feedforward(self, x):
        #Propagates the input from the input layer to the output layer.
        for i in range(1,self.n_layers):
        return self.outputs[-1]

    def update_weights(self,x,y):
        #Update the weight matrices for each layer based on a single input x and target y.
        output = self.feedforward(x)

        for i in xrange(n,0,-1):
            self.errors[i] = self.fprimes[i](self.inputs[i])*self.weights[i].T.dot(self.errors[i+1])
            self.weights[i] = self.weights[i]-self.learning_rate*np.outer(self.errors[i+1],self.outputs[i])
            self.biases[i] = self.biases[i] - self.learning_rate*self.errors[i+1]
        self.weights[0] = self.weights[0]-self.learning_rate*np.outer(self.errors[1],self.outputs[0])
        self.biases[0] = self.biases[0] - self.learning_rate*self.errors[1] 
    def train(self,n_iter, learning_rate=1):
        #Updates the weights after comparing each input in X with y
        #repeats this process n_iter times.
        for repeat in range(n_iter):
            #We shuffle the order in which we go through the inputs on each iter.
            for row in index:

    def predict_x(self, x):
        return self.feedforward(x)

    def predict(self, X):
        n = len(X)
        m = self.sizes[-1]
        ret = np.ones((n,m))
        for i in range(len(X)):
            ret[i,:] = self.feedforward(X[i])
        return ret

And we’re done! Now to test it we generate some synthetic data and supply the network with some activation functions.

def logistic(x):
    return 1.0/(1+np.exp(-x))

def logistic_prime(x):
    return ex/(1+ex)**2

def identity(x):
    return x

def identity_prime(x):
    return 1

First we will try to get it to approximate a sine curve.

#expit is a fast way to compute logistic using precomputed exp.
from scipy.special import expit
def test_regression(plots=False):
    #First create the data.
    #We make a neural net with 2 hidden layers, 20 neurons in each, using logistic activation
    param=((1,0,0),(20, expit, logistic_prime),(20, expit, logistic_prime),(1,identity, identity_prime))
    #Set learning rate.
    for rate in rates:
        N.train(4000, learning_rate=rate)
    import matplotlib.pyplot as plt
    fig, ax=plt.subplots(1,1)
    if plots:
        ax.plot(X,y, label='Sine', linewidth=2, color='black')
        for data in predictions:
            ax.plot(X,data[1],label="Learning Rate: "+str(data[0]))

When I ran this it produced the following:

Next we will try a classification problem with a nonlinear decision boundary, how about being above and below the sine curve?

def test_classification(plots=False):
    #Number samples


    #Samples for true decision boundary plot
    l = np.sin(L)

    #Data inputs, training
    X = np.random.uniform(0, 3*np.pi, size=(n,2))
    X[:,1] *= 1/np.pi
    X[:,1]-= 1

    #Data inputs, testing
    T = np.random.uniform(0, 3*np.pi, size=(n,2))
    T[:,1] *= 1/np.pi
    T[:,1] -= 1

    #Data outputs
    y = np.sin(X[:,0]) <= X[:,1]

    param=((2,0,0),(30, expit, logistic_prime),(30, expit, logistic_prime),(1,expit, logistic_prime))
    N=NeuralNetwork(X,y, param)
    N.train(n_iter, learning_rate)
    predictions_training= predictions_training <0.5
    predictions_training= predictions_training[:,0]
    predictions_testing= predictions_testing <0.5
    predictions_testing= predictions_testing[:,0]

    import matplotlib.pyplot as plt
    fig, ax=plt.subplots(2,1)

    #Training plot
    #We plot the predictions of the neural net blue for class 0, red for 1.
    ax[0].scatter(X[predictions_training,0], X[predictions_training,1], color='blue')
    not_index = np.logical_not(predictions_training)
    ax[0].scatter(X[not_index,0], X[not_index,1], color='red')
    ax[0].set_xlim(0, 3*np.pi)
    #True decision boundary
    ax[0].plot(L,l, color='black')
    #Shade the areas according to how to they should be classified.
    ax[0].fill_between(L, l,y2=-1, alpha=0.5)
    ax[0].fill_between(L, l, y2=1, alpha=0.5, color='red')

    #Testing plot
    ax[1].scatter(T[predictions_testing,0], T[predictions_testing,1], color='blue')
    not_index = np.logical_not(predictions_testing)
    ax[1].scatter(T[not_index,0], T[not_index,1], color='red')
    ax[1].set_xlim(0, 3*np.pi)
    ax[1].plot(L,l, color='black')
    ax[1].fill_between(L, l,y2=-1, alpha=0.5)
    ax[1].fill_between(L, l, y2=1, alpha=0.5, color='red')    


When I ran this it looked like this.
The top plot shows the performance on the training data, and the bottom the performance of on the testing data. The points are colored according to the net’s predictions, whereas the areas are shaded according to the true classifications.

As you can see neural networks are capable of giving very good models, but the number of iterations and hidden nodes may be large. Less iterations are possible using good configurations of the network i terms of sizes and numbers of hidden layers, and choosing the learning rate well, but it is difficult to know how to choose these well.

It ended up being quicker to choose a larger number of hidden nodes, a large number of iterations, and a small learning rate, than to experiment finding a good choice. In the future I may write about how to use momentum to speed up training.

The code for this post is available here, please comment if anything is unclear or incorrect!

Neural Networks Part 1: Feedingforward and Backpropagation


Last time we looked at the perceptron model for classification, which is similar to logistic regression, in that it uses a nonlinear transformation of a linear combination of the input features to give an output class.

A neural network generalizes this idea by conceiving of a network of nodes which each take as an input a linear combination of the outputs of other nodes, then use some activation function to turn this into an output, such as:

A Feedforward Neural Network

A neural network is just a directed graph. We will be looking at feedforward neural networks, which are acyclic.

Each node is called a neuron. There are three main kinds of neurons:

  • Input neurons, one neuron for each feature, so if our input were vectors in \mathbb{R}^3, we would have 3 input neurons. These neurons have trivial activation functions and no incoming edges.
  • Output neurons, one neuron for each output feature, so if our so if our output were vectors in \mathbb{R}^3, we would have 3 output neurons. These will typically have a nonlinear activation function, chosen so that the output range matches our training data. These neurons have no outgoing edges.
  • Hidden neurons are in between input and output neurons and have both incoming and outgoing edges.

Since the graph is acyclic, we can think of it as being composed of layers, as in the following picture taken from wikipedia:

although note there can be many hidden layers. Each edge has a weight, a real number.

The neural network makes a prediction by feeding forward the input. To describe this we develop some notation. The discussion that follows is indebted to this very helpful page.


Let N be a neural network and G index the vertices of N so that each N_i is a vertex for i \in G.

For each i \in G let T_i denote the set of indices of the targets of N_i. That is the set of indices X \subset G such for each j \in X there exists a directed edge e_{ij} from N_i \to N_j. We may also conflate T_i with the set of indexed neurons if the context makes this unambiguous.

Similarly, for each i \in G let S_i denote the set of indices of the sources of N_i.

For each index i, let f_i: \mathbb{R} \to \mathbb{R} denote the activation function of N_i.

For each index i, let O_i(x) \in \mathbb{R}:=f_i(x) be the output of N_i. When the argument x is understood, we may simply write O_i.

For each directed edge e_{ij} define w_{ij} \in \mathbb{R} to be the weight of the edge.

Where the outputs are understood, for each index i, define the input I_i to be \sum \limits _{j \in S_i}w_{ji}O_j, the weighted sum of the outputs of the previous layer.

Feeding Forward

Feeding forward refers to how an input becomes an output or prediction.

As you might have guessed: feeding forward an input vector x, the input and output to each input neuron is just the corresponding component of the vector x.

From there the input to a neuron in the first hidden layer is simply the sum of the components of x, weighted by the weights of the incoming edges to that hidden node.

You repeat this for each hidden node. Then the inputs to the hidden nodes are transformed via their activation functions to their outputs. The weighted sums of these outputs are then passed on as inputs to the next layer, and so on.

This is probably best understood through a simple example.

nn1Here we have 4 neurons. The input xis two dimensional, so we have two input neurons shown in blue. We have a single hidden layer with just one hidden neuron, shown in green. Finally we have a single output neuron shown in orange. Hence the output y is one dimensional.

Choosing the Weights: Backpropagation

So we have seen how a neural network can make predictions, but how can we choose the weights so that the predictions are any good? The answer is the same as in logistic regression, we use gradient descent. In order to do this we need to use differentiable activation functions and choose an appropriate error function to minimize.

If x \in \mathbb{R}^k is the input vector, and y \in \mathbb{R}^n is the target vector, let f_N: \mathbb{R}^k \to \mathbb{R}^n be the prediction of the neural network. Then we define the error to be e(x)= \frac{1}{2}(f_N(x)-y)^2. This is the natural Euclidean norm of the difference between the target and the prediction. The factor of a half is there for aesthetical reasons, since we will be differentiating and gaining a factor of 2.

The trick to the algorithm is to figure out how the weights contribute to the error, and backpropagation is an elegant way of assigning credit/blame to each node and edge. We fix an input x, and write e for e(x), and for each each edge e_{ij} we are seeking \frac{\partial e}{\partial w_{ij}}.  Once we have found this, we set w_{ij} \mapsto w_{ij} - \lambda \frac{\partial e}{\partial w_{ij}}, where \lambda is the learning rate.  So we are doing stochastic gradient descent as in the perceptron.

It is clear that an edge e_{ij} and its weight w_{ij} contribute to the error through the target N_j‘s input. So it makes sense to divide and conquer using the chain rule, that is

\frac{\partial e}{\partial w_{ij}} = \frac{\partial e}{\partial I_j} \cdot \frac{\partial I_j}{\partial w_{ij}}.

The first part of the product is worth giving a name,  we denote it by E_j:=\frac{\partial e}{\partial I_j} and call it the error at the node.

The second part of the product is simple to calculate:

\frac{\partial E_j}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}} \sum \limits _{k \in S_j}w_{kj}O_k = O_i

Returning to E_j, we can calculate it recursively, assuming the errors for the neurons in T_j have already been calculated.

Observe that e can be thought of as a function of the variables O_k for k indexing any layer, and so using the chain rule we have

E_j = \frac{\partial e}{\partial I_j}

= \sum \limits_{k \in T_j} \frac{\partial e}{\partial I_k}\cdot \frac{\partial I_k}{\partial I_j}

= \sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial I_j}

=\sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial O_j} \cdot \frac{\partial O_j}{\partial I_j}

=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial O_j}

=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot \left ( \frac{\partial}{\partial O_j} \sum \limits _{l \in S_k}w_{lk}O_l \right )

=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot w_{jk}

notice that if we assume that each layer is fully-connected to the next, then we could write the above succinctly as an inner product, and inner products can be calculated very efficiently, for instance using a GPU, so it is often faster to have more connections!

Together this gives

\frac{\partial e}{\partial w_{ij}}=O_i \cdot f_j'(I_j)\cdot\sum \limits_{k \in T_j} E_k \cdot w_{jk}.

Since we have calculated the error at a node recursively, we had better deal with the base case –  the output neurons. But this follows simply from the definition of  e to be E_i=f'(I_i)(O_i-y_i), with the factor of 2 cancelling with the half as promised.

Bias Neurons

The final piece of the puzzle are called bias neurons, which I had left out before to simplify the discussion. Each layer, except the output layer, has a bias neuron. A bias neuron has no inputs, and its output is always 1. Effectively it just adds the weight of its outgoing edges to the inputs of the next layer.

These behave just like regular neurons, only when updating their weights, we note that since their input is effectively constant, they have \frac{\partial I_j}{\partial w_{ij}}=1, and so we need only calculate the error at a bias neuron to update its weights.

Next Time

In the next installment I will show how to compactly represent the process of feeding forward and backpropagation using vector and matrix notation assuming the layers are fully connected. We will then use this to give a Python implementation.

Enter the Perceptron


Some time ago in my post on logistic regression, I talked about the problem where you have a n \times p matrix X representing n data points with p features, and a vector y \in \{0,1 \}^n of classifications.

In linear regression you would produce a vector \beta \in \mathbb{R}^p so that X\beta \approx y, but the problem here is that our output could be any real number, not \{0,1\}. The solution we investigated was using the logistic link function p(x) = \frac{1}{1+\exp(-x)} which turns any real number into something we interpret as a probability.

That was one solution, another even simpler one is the function defined s(x) = \left \{\begin{array}{cc} 1 & x \ge 0 \\ -1 &x<0\end{array} \right..  It is convenient in this case to think of the response as taking values in \{-1,1 \} not \{0,1\}.

Previously we added a column of ones to X in order to create an intercept term, here we will write it separately so so that X \mapsto X\beta + b where b \in \mathbb{R}^n is called the bias.

This post will take the book ‘The Elements of Statistical Learning’ as its guide (the section starting on p. 130), which is freely available online, but there is also a great blog post about it here.

The Algorithm

So define f(x):=\beta^Tx+b, we classify a point as being in the positive class if f(x)>0 and negative if f(x) <0. Geometrically we have an affine hyperplane L:=\{x:f(x)=0 \}, and points are classified depending on whether they are ‘above’ or ‘below’ the hyperplane.

The unit normal u to L is given by \beta/ \Vert \beta \Vert, points above L then mean points with positive projection onto this vector.

The signed distance of a point x to L is found by projecting the vector x-x_0 onto u, where x_0 is the ‘origin’ of the hyperplane (since an affine hyperplane is a translation of a linear subspace, there is a point corresponding to the translation of the origin). So we get

u^T(x-x_0)= \beta^T/ \Vert \beta \Vert (x-x_0) = (\beta^Tx+b)/ \Vert \beta \Vert = f(x)/ \Vert \beta \Vert = f(x)/ \Vert f'(x) \Vert,

notice this means that f(x) is proportional to the signed distance of x from L.

If M is the set of misclassified points, then we can measure the error as the sum of the distances (up to a constant of proportionality), of the misclassified points to L, that is

D(\beta, b):=-\sum \limits _{x \in M}y_x f(x),

where y_x is the response of x, notice that this makes each summand negative, which is reversed by the minus sign outside.

If we fix M we can differentiate with respect to \beta and b, giving

\frac{\partial}{\partial \beta}D(\beta, b)= -\sum \limits _{x \in M}y_xx,


\frac{\partial}{\partial b}D(\beta, b)= -\sum \limits _{x \in M}y_x.

Just like in logistic regression, we can use these partial derivatives to search for a minimum of D. The algorithm purposed by Frank Rosenblatt, the inventor of the perceptron, is to modify the hyperplane after looking at each individual misclassified item in M as it comes, that is (\beta, b) \mapsto (\beta,b) + \lambda(y_xx,y_x), where \lambda >0 is the learning rate. This has the advantage of being much faster than computing the global gradient informed by all of the data, and is an example of stochastic gradient descent.

Of course by changing the hyperplane as we go we may change M, and so what we actually do is iterate over each datapoint in a fixed order, and check if it is misclassified under the latest hyperplane, then update and move onto the next point. This is then very easy to code.

The Code

First let’s recycle some code we used for logistic regression to get some test data.

import numpy as np
from scipy import linalg
import random as rd
import matplotlib.pyplot as plt
def test_set(a,b, no_samples, intercept, slope, variance, offset):
    #First we grab some random x-values in the range [a,b].
    X1 = [rd.uniform(a,b) for i in range(no_samples)]
    #Sorting them seems to make the plots work better later on.
    #Now we define our linear function that will underly our synthetic dataset: the true decision   boundary.
    def g(x):
        return intercept + slope*x
    #We create two classes of points R,B. R points are given by g(x) plus a Gaussian noise term with positive mean,
    #B points are given by g(x) minus a Gaussian norm term with positive mean.
    X2=[] #This will hold the y-coordinates of our data.
    Y=[]  #This will hold the classification of our data.
    for x in X1:
        if R_turn:
            x2 += rd.gauss(offset, variance)
            x2 -= rd.gauss(offset, variance)
        R_turn=not R_turn
    #Now we combine the input data into a single matrix.
    X1 = np.array([X1]).T
    X2 = np.array([X2]).T
    X = np.hstack((X1, X2))
    #Now we return the input, output, and true decision boundary.
    return [ X,np.array(Y).T, map(g, X1)]

Fairly straightforward stuff I hope! If not then it should make sense when we plot it.

X,Y,g = test_set(a=0, b=5, no_samples=200, intercept=10, slope=0, variance=0.5, offset=1)
fig, ax = plt.subplots()
R =  X[::2] #Even terms
B =  X[1::2] #Odd terms

ax.scatter(R[:,0],R[:,1],marker='o', facecolor='red', label="Outcome 1")
ax.scatter(B[:,0],B[:,1],marker='o', facecolor='blue', label="Outcome 0")
ax.plot(X[:,0], g, color='green', label="True Decision Boundary")
ax.legend(loc="upper left", prop={'size':8})

This will produce something like:


Now to implement the algorithm.

def perceptron(X,y, learning_rate=0.1, max_iter=1000):
    #Expecting X = n x m inputs, y = n x 1 outputs taking values in -1,1
    weight = np.zeros((m))
    bias = 0
    index = list(range(n))
    while n_iter <= max_iter:
        for row in index:
            if (X[row,:].dot(weight)+bias)*y[row] <=0: #Misclassified point.
                weight += learning_rate*X[row,:]*y[row]
                bias += learning_rate*y[row]
    return {'weight':weight, 'bias':bias}

we shuffle the order we take the data points in at each step to get rid of any peculiarities in the order we update. Now let’s try it on our test data.

result= perceptron(X,Y,0.1)
def decision_boundary(x, weight, bias):
    return (-bias - x*weight[0])/weight[1]

fig, ax = plt.subplots()
R =  X[::2] #Even terms
B =  X[1::2] #Odd terms

ax.scatter(R[:,0],R[:,1],marker='o', facecolor='red', label="Outcome 1")
ax.scatter(B[:,0],B[:,1],marker='o', facecolor='blue', label="Outcome -1")
ax.plot(X[:,0], g, color='green', label="True Decision Boundary")
ax.plot(X[:,0], [decision_boundary(x, result['weight'], result['bias']) for x in X[:,0]],
        label='Predicted Decision Boundary')
ax.legend(loc="upper left", prop={'size':8})


perceptron_test_resultsNot bad! Get the code here.

My First Genetic Algorithm

‘Genetic’ is one of those words that make things cool, some others are topological, chaos or chaotic, fractal, quantum and recursive. That was not intended to be an exhaustive list. So for a while I have wanted to have a go at making a genetic algorithm and I have finally gotten around to it.


How they work

First let me refer you to the wiki, then I’ll quickly talk about the things I think are important. The main idea is to use an search algorithm inspired by evolution to solve a problem.

There are four core ideas.

Parametrization of Solution

To begin with we assume that a solution takes the form of a binary vector of length m. Each component of a solution vector can be thought of as a gene, which can be on or off (1 or 0 resp.).

The vectors I will call individuals, or creatures, or candidate solutions, or vectors. The population is a set of individuals, to begin with we initialize the population randomly or however you like.


We also assume that we have a function f: 2^n: \to \mathbb{R}, this measures an individual’s fitness. Fitness is how well the solution, parametrized by the binary vector, solves your problem. So we will assume that bigger numbers mean better solutions.


Once we have a population, we assess their fitness.  Now we want to create new individuals, hopefully with better fitness. We do this by combing different parent solutions to create child solutions. One simple way to do this is called uniform crossover , a function taking solutions to solutions is called a genetic operator, a binary genetic operator is a crossover. Uniform crossover simply means that the child’s ith gene is equal to the father’s or mothers with equal probability (alert! hetronormative language in use! please evacuate the building! ). But not everybody get’s to be a parent, ideally you want the chance that a creature has a child to depend in some way on its fitness.

One way of selecting parents is called tournament selection. The idea is that you grab k creatures from your population, and the one with the highest fitness is selected. The parameter k then controls selection pressure.


‘Mutation’ also happens to be one of the cool words. In this context it is an unary genetic operator, that we think of as perturbing a solution vector. There are many ways to do this, perhaps to the simplest is to flip each gene from on to off or vice versa with probability p called the mutation rate.

The problem with this can be that it encourages genes to be on where there are few on and off when there are many on. If the number of on genes is part of what is interesting, say if your vector represented variable selection in a machine learning model, then you do not want the mutation to be biased towards a certain number of on genes.

This means that ideally you should use a kind of mutation suitable for your problem, but even if you don’t make the best choice, natural selection will combat biases in the mutation, more or less successfully.

Since I want the expected number of on genes to be the same after mutation, my approach is to, with probability p pick an on gene at random and turn it off and with probability p pick an off gene and turn it on, repeated n times.

Ok I have talked enough about the theory, let’s make one! First we need a problem to solve.

The Problem

I will randomly choose an integer k between 0 and 2^n. Your job is to guess what k is.

The information you are given is that after each guess \phi, I will select N random integers from the same range,  write it as a binary vector, and then I will then tell you how often either:

1. The random integer is both \le \phi and \le k coordinate-wise; or

2. The random integer is both not \le \phi and  not  \le k coordinate-wise.

So I will tell you an integer from 0, \dots, N that describes the size of the subset for which \phi, k are upper bounds and the size of the subset for which \phi, k are lower bounds.

The Code

You can view it statically here, or download the notebook and play with it yourself, or just get the source – happy evolutions!

Creating Dummy Variables in Pandas

Last time we implemented logistic regression, where the data is in the form of a numpy array. But what if your data is not of that form? What if it is a pandas dataframe like the Kaggle Titanic data?

Specifically the problem is variables like ‘Title’ where we have four strings ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’ as values. The solution is to create a column for each value, for example ‘Title_Mr’, with values 0,1 depending on whether the data point has that value. Reading Wes McKinney’s book again, he explains what to do, and so I have created a handy little function using his example.

It will accept as input a dataframe, and a dictionary telling which variables are nominal. It will then replace each nominal variable in your dataframe with a set of dummy columns, and also update your data type dictionary. Simple but effective.

def dummy_variables(data, data_type_dict):
    #Loop over nominal variables.
    for variable in filter(lambda x: data_type_dict[x]=='nominal',

        #First we create the columns with dummy variables.
        #Note that the argument 'prefix' means the column names will be
        #prefix_value for each unique value in the original column, so
        #we set the prefix to be the name of the original variable.
        dummy_df=pd.get_dummies(data[variable], prefix=variable)

        #Remove old variable from dictionary.

        #Add new dummy variables to dictionary.
        for dummy_variable in dummy_df.columns:
            data_type_dict[dummy_variable] = 'nominal'

        #Add dummy variables to main df.
        data=data.drop(variable, axis=1)

    return [data, data_type_dict]

Easy. Now once everything is numeric, we can as use np.asarray to cast our dataframe to a numpy array.