Home » machine learning

# Category Archives: machine learning

## Support Vector Machines and the Kernel Trick

Today we will be discussing a machine learning algorithm that is probably the most mathematically sophisticated of all the algorithms we have looked at so far.

Don’t let that daunt you though! Not all the details are essential for understanding the general idea, and it can certainly be used without understanding the maths, although if you are content with this then there’s not much point reading a blog like this!

This post is based on what I have learned from reading the relevant chapter in the excellent book: Elements of Statistical Learning.

The Setup

The premise is exactly the same as the perceptron algorithm, we have input data $x_1, \dots , x_N \in \mathbb{R}^k$ with class labels $y_1, \dots, y_N \in \{-1,1\}$.

In order to classify the inputs, we seek a seperating hyperplane $H$. Recall that a hyperplane is just a plane that may not pass through the origin, and can be described as

$H= \{ x \in \mathbb{R}^k : \langle x, \beta \rangle =b \}$ for some $\beta \in \mathbb{R}^k$ and $b \in \mathbb{R}$.

It is handy to assume that $\beta$ is a unit vector, and geometrically it is the normal vector to the hyperplane. I will refer to $b$ as the bias. If the bias is zero, then $H$ is just a regular plane, and if it is nonzero $H$ is parallel to this plane, the bias controls how far ‘above’ or ‘below’ the origin $H$ is. In fact if $H_0 =\{x: \langle x,\beta \rangle =0 \}$, then $H=H_0 + b\beta$.

If we define

$f: \mathbb{R}^k \to \mathbb{R}$

by

$f(x) = \langle x,\beta \rangle - b$,

then $H = f^{-1}(0)$.

Moreover, $f(x)$ is the signed distance of $x$ from $H$. Note that if $\beta$ was not assumed to be a unit vector, it would instead be proportional to the signed distance. If $f(x) >0$ we say that $x$ is above $H$, and if $f(x) < 0$ we say that $x$ is below $H$.

To classify a point $x$, we say it has label $1$ if above $H$, and $-1$ if below.

The Problem(s)

If this is possible, then the perceptron algorithm will converge to a hyperplane that correctly classifies all of the points.

One problem is that there are infinitely many such hyperplanes, intuitively we would like to choose one to maximize the distances of all the points from it. If there are points close to the decision boundary, then it is likely that unseen points from the same distribution might up on the wrong side of the hyperplane and be misclassified. The minimum distance of all the points to $H$ is called the margin of the decision boundary, so we would like to maximise the margin.

Another problem is that the data might be too mixed up for there to be a seperating hyperplane, no matter which one you choose some points get misclassified. What we need is a way of choosing the hyperplane that does the best job of trying to seperate the points.

Finally it may be that a hyperplane is a bad approximation to the true decision boundary, we need something wiggilier.

Let’s solve these problems in turn.

Maximizing the Margin

The first step in solving a machine learning problem is to figure out how to phrase it as an optimization problem, so let’s try that.

Leting $M$ denote the margin, we would like to

maximize $M$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge M$ for $i=1, \dots, N$, $M>0$ and $\Vert \beta \Vert =1$.

Note that $y_i f(x_i) >0$ implies that $x_i$ is correctly classified since their signs must be the same for the product to be positive.

Now we just do a litle finessing, notice that

$y_if(x_i) \ge M$, ie $y_i(\langle x_i, \beta \rangle -b) \ge M$

is the same as

$y_i(\langle x_i, \frac{\beta}{M} \rangle -\frac{b}{M}) \ge 1$,

and so the optimization problem is equivalent to:

maximize $M$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge 1$ for $i=1, \dots, N$, $M>0$ and $\Vert \beta \Vert =1/M$,

and so maximizing $M$ is the same as minimizing the norm of $\beta$ in this formulation, giving:

minimize $\Vert \beta \Vert$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge 1$ for $i=1, \dots, N$.

Finally minimizing the norm is the same as minimizing half the square of the norm, and the latter is much more convienient since it doesn’t have any of those pesky square roots:

minimize $\frac{1}{2}\Vert \beta \Vert^2$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge 1$ for $i=1, \dots, N$.

Now that we have the optimization problem setup, we use the Karush-Kuhn-Tucker conditions, normally when trying to minimize a function you look for places where the derivative is zero, and this the extension of that idea when you have constraints.

The conditions firstly say that

$\beta= \sum \limits _{i=1}^N \mu_i y_i x_i$,

and

$\sum \limits _{i=1}^n y_i \mu_i =0$,

where

$\mu_i \ge 0$ for $i=1, \dots, N$

and

$\mu_i y_i(1-\langle x_i, \beta \rangle)=0$ for $i=1, \dots, N$.

The last condition shows that $\mu_i > 0$ only for those $x_i$ such that $\langle x_i, \beta \rangle=1$, these are the $x_i$ that are ‘on the margin’, those input vectors closest to the optimal hyperplane.

This means that $\beta = \sum \limits _{i=1}^N \mu_i y_i x_i$ is determined only by these $x_i$ on the margin, we call these the support vectors.

The optimization problem can be solved using quadratic programming techniques, it is also sometimes useful to look at the equations given by the KKT conditions and convert it into the ‘dual’ optimization problem in terms of the $\alpha_i$.

Compromising

Life isn’t always easy, and sometimes the data is not linearly seperable. It is desirable that our algorithm be able to tolerate some errors and try to minimize them, otherwise, even if the data are linearly seperable, the algorithm may fall over itself trying to fit a few bad apples in the data, rather than going for a more natural fit that will hopefully generalize better.

We do this by introducing slack variables $\varepsilon_i \ge 0$ for $i=1, \dots, N$ and changing our constraint from

$y_if(x_i) \ge M$

to

$y_if(x_i) \ge M(1-\varepsilon_i)$,

and as before this changes to

$y_i f(x_i) \ge 1 -\varepsilon_i$.

Of course we want to minimize the errors, and so we can recast our optimization problem as

minimize $\frac{1}{2} \Vert \beta \Vert^2+ C\sum \limits_{i=1}^N\varepsilon_i$,

subject to $y_if(x_i) \ge 1 - \varepsilon_i$, $\varepsilon_i \ge 0$ for $i=1, \dots, N$,

where $C$ is a cost parameter that we need to choose, controlling our tolerance for mistakes.

The KKT conditions for this new problem are

$\beta = \sum \limits _{i=1}^N \alpha_i y_i x_i$,

$\sum \limits _{i=1}^n y_i \alpha_i =0$,

$\alpha_i = C - \mu_i$,

where $\mu_i \ge 0$ for $i=1, \dots, N$.

Instead of trying to minimize an objective function subject to several complex constraints, it is often easier to collect the constraints into the Lagrangian and minimize that, subject to simpler constraints, this is called the penalty method.

In our case the Lagrangian is

$L_P(\beta, b, \varepsilon, \alpha, \mu) = \frac{1}{2}\Vert \beta \Vert^2 + C \sum \limits _{i=1}^N \varepsilon_i - \sum \limits _{i=1}^n \mu_i \varepsilon_i - \sum \limits _{i=1}^N \alpha_i(y_i(x_i^T\beta+b)-(1-\varepsilon_i))$,

and minimizing our objective function is equivalent to minimizing $L_P$ with respect to $\beta, b, \varepsilon$ whilst at the same time maxmizing $L_P$ with respect to $\mu, \alpha$. This is why this is called a saddle-point method.

The constraints for this equivalent problem are $\alpha, \mu \ge 0$ and $\sum \limits_{i=1}^N \alpha_i y_i=0$, the others are taken care of automagically.

Now we can substitute the KKT condition equations above to simplify $L_P$, giving the dual Lagrangian

$L_D(\alpha) = \sum \limits _{i=1}^N \alpha_i - \frac{1}{2} \sum \limits_{i,j=1}^N \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle$.

So our new problem is

maximize $L_D$

subject to $\mu \ge 0, \alpha \ge 0$ and $\sum \limits _{i=1}^N \alpha_i y_i=0$,

but $\mu$ doesn’t appear in $L_D$! But then we notice that the KKT condition

$\alpha_i = C -\mu_i$

is equivalent to constraining $\alpha_i \in [0, C]$.

So our final formulation is

maximize $L_D$

subject to $0 \le \mu \le C$ and $\sum \limits _{i=1}^N \alpha_i y_i=0$.

A much simpler formulation! In fact this is a convex quadratic programming problem, so it can be solved using out of the box methods from a library.

The Kernel Trick

We’ve now seen how to cope with errors by using a cost parameter $C$, this is a form of regularization that can control overfitting. But what if the true decision boundary is geninuely something a bit more complex than a hyperplane? The solution to this problem is to change coordinates from $\mathbb{R}^k \to \mathbb{R}^m$ using some mapping $T$, then fit a hyperplane $H$ in this new space, then the pullback $T^{-1}H$ can be a nonlinear surface in $\mathbb{R}^k$.

This is of course a standard practice for any machine learning algorithm, feature engineering. What is special about support vector machines is that the function to be minimized is

$L_D(\alpha) = \sum \limits _{i=1}^N \alpha_i - \frac{1}{2} \sum \limits_{i,j=1}^N \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle$,

notice that you can compute this using only inner products in whatever coordinate system you use. This means that if you have a function

$k: \mathbb{R}^k \times \mathbb{R}^k \to \mathbb{R}$

such that

$k(x,y) = \langle T(x), T(y) \rangle$,

then there is no need to work directly in the transformed space at all! In fact you don’t even need to know what $T$ is, as long as you have $k$.

This is a big deal because this can be much more computationally efficient if you want to transform to a very high dimensional space. This is what is called the kernel trick, because we get to have the benefits of a richer feature space on the cheap.

So in order to do this, you just have to dream up a kernel $k$, to make sure that your $k$ actually defines an inner product in some (Hilbert) space, by Mercer’s Theorem what you need to check is that, for any choice of vectors $x_1, \dots, x_n$, the matrix $[k(x_i, x_j)]_{i,j}$ is positive semi-definite.

Some common choices are:

dth-degree polynomial kernel: $k(x,y)=(1+\langle x,y \rangle)^d$,

radial basis kernel: $k(x,y) = \exp(\gamma \Vert x-y \Vert^2)$.

I have decided not to do a python implementation this time, because I would have to get in to quadratic programming to do it properly, perhaps I will write a post on this topic in the future.

## Theano, Autoencoders and MNIST

Recently I have finally upgraded my ancient laptop. Other than the ability for me to play the occasional video game, this means that I now have a dedicated Nvidia graphics card, in particular one that supports something called CUDA.

Now that I have the hardware, I thought that it would be fun to check out this python library theano. As I understand it, it has two very interesting aspects.

One is that it does symbolic computation, that is it represents variables as in mathematics, and their manipulations.

A variable in computer science is something like a integer or a string or  a list etc…, a piece of information stored somewhere in memory with a name made up by the programmer, and the compiler takes care of associating the human understandable name to the address of the information in memory.

A variable in mathematics is like $x$ in $x^2 + x + 1$, an indeterminate quantity that can be manipulated using arithmetic operations, or using calculus and so on. So what in mathematics we call a variable corresponds roughly to a function in computer science.

The difference between symbolic computation and its variables, and regular functions, is that the programming language is aware of the mathematical nature of a variable, such that when you apply operations to variables to create new variables, it does this in an efficient way. The best way to understand this is a simple example. Let’s suppose we have two functions $f(x) =x, g(x) = -x$. Now we can combine these functions by adding them to make a new function $h(x) = f(x) + g(x)$.

Now in python say, when you call $h$ with an argument, it will pass this argument to both $f$ and $g$ and then add the result and return it.

If we are doing symbolic computation, and $f,g$ were variables with $f=-g$, then the variable $h=f+g=0$ constantly. And so the computer does no work at all when using $h$, because it knows that it must always be zero.

So one important aspect is that symbolic computation can simplify and hence often optimize functions automatically.

Another is that theano can automagically differentiate variables with respect to variables! This might not sound like a big deal, but it is. In machine learning, we often create a complicated function, like in logistic regression or neural networks, and then derive a helpful form for the derivative to do gradient descent to optimize it. In theano, all you have to do is specify the objective function, and it will let you do gradient descent automatically.

The second amazing thing about theano is that it compiles its theano functions into C, and so it often runs much much faster than native python or numpy. It can also make it easier to run computations on a gpu which can mean a 100x speedup for the right cases!

Hopefully I’ve made you curious, so  click here for an installation guide. On this linux machine it was very simple. I am dual booting windows 8.1 and Ubuntu at the moment, it is very tricky to get the gpu aspect to work on windows, but here is a guide for the brave. To check that the GPU aspect is working, run the test code here.

Introduction to Theano

I recommend that you read the tutorials here, and also this excellent ipython worksheet here.

Let’s little play with the basic functionality.

import theano as th
from theano import tensor as T

#To make a variable, specify the type of variable from the tensor
#library.

#We can have scalars.
x=T.dscalar(name='scalar_x')

#Note that the name is just a string that theano will use
#to communicate to us about the variable.
print x
print x.type
#We can also have vectors.
v = T.dvector(name='vector_v')
print v
print v.type
#And matrices.
A = T.dmatrix(name='matrix_A')
print A
print A.type

#We can also make new variables using standard mathematical operations.
x_2 = x*x
print x_2
print x_2.type

#We can also make python functions which return new variables.
def power(variable, n):
return variable**n
x_10 = power(x,10)
print x_10
print x_10.type

#We can of course do standard linear algebra operations also.
Av = T.dot(A,v)
print Av
print Av.type

#Of course when a variable is actually evaluated, you must
#ensure that the dimensions are correct, or you will
#get an error.

#To see the value of a variable for a particular value of the variables
#comprising it, we make a theano function.
f = th.function([A,v], [Av])
#The syntax is a list of input variables, and then a list of output variables.

#The python code takes a little longer to run initially, because thenao
#compiles the function into C, but thereafter it will run extremely fast.

#Let's try using the function.
import numpy as np
m=np.ones((2,2))
print m
vec = np.asarray([1,2])
print vec
print f(m,vec)

#Now we can try computing gradients.
#First let's make a scalar variable by taking an inner product.
w=T.dvector(name='w_vector')
vTw = T.dot(v,w)

#Now we take the gradient with respect to w.

#Now let's test it.
vec1 = np.asarray([1,2])
vec2 = np.asarray([0,0])

#To evaulate a variable given inputs, there is another syntax
#in additon to creating a thenano function.

#Next we discuss theano's shared variables,
#these differ from regular theano variables in that
#theano variables only have a value within a theano function
#whereas shared theano variables have a value independent of
#being called in a function.
w=T.shared(name='shared_matrix', value=np.ones((2,3)))
print w
print w.type
print w.get_value()

#You can also set a shared variable's value.
w.set_value(np.zeros((2,3)))
print w.get_value()

#You can also have theano functions update shared variables,
#we illustrate this with a silly example that updates a matrix
#like in gradient descent.
x=T.dvector('x')
wx=T.dot(w,x)
cost = wx.mean()

#Notice the syntax of updates argument, should be a list
#of two tuples of the form: (variable_to_be_updated, updated_variable).
print f([1,1,1])
print w.get_value()


If you want to tinker with this, get the sourcecode here.

Autoencoders

I don’t want to go into too much detail here, but an autoencoder is a feedforward neural network with a single hidden layer with the twist being that its target output is exactly the same as its input. If the number of hidden neurons is less than the number of input/output neurons, then the network will be forced to learn some sort of compression.

Even when the number of hidden neurons is large, there are various tricks in cost functions you can use to ensure that the network does something more interesting than learning the identity function.

The hope of an autoencoder is that it learns some structure in the data, so that the hidden layer can be thought of as a feature space that one can use for a supervised learning algorithm.

Basically then we have $x \mapsto f_1(W_1x+b_1) =h$ the hidden layer, then $h \mapsto f_2(W_2h+b_2)$, where $b_1, b_2$ are bias vectors, $W_1, W_2$ are matrices, and $f_1, f_2$ are element-wise non-linearities like the logistic function.

A simplification that is often made is to enforce $W_1 = W_2^T$. This gives the model less free parameters to learn, and so is a form of regularization. It also uses less memory. This is referred to as an autoencoder with tied weights.

Autoencoders with Theano

Thanks to automatic differentiation, backprogation is effortless with theano, and thanks to gpu computation, much faster also! I recommend having a look at my python implementation of backpropagation just so you can see the effort saved. This implementation benefited greatly from this tutorial.

import numpy as np
import theano as th
from theano import tensor as T
from numpy import random as rng

class AutoEncoder(object):
def __init__(self, X, hidden_size, activation_function,
output_function):
#X is the data, an m x n numpy matrix
#where rows correspond to datapoints
#and columns correspond to features.
assert type(X) is np.ndarray
assert len(X.shape)==2
self.X=X
self.X=th.shared(name='X', value=np.asarray(self.X,
dtype=th.config.floatX),borrow=True)
#The config.floatX and borrow=True stuff is to get this to run
#fast on the gpu. I recommend just doing this without thinking about
#it until you understand the code as a whole, then learning more
#about gpus and theano.
self.n = X.shape[1]
self.m = X.shape[0]
#Hidden_size is the number of neurons in the hidden layer, an int.
assert type(hidden_size) is int
assert hidden_size > 0
self.hidden_size=hidden_size
initial_W = np.asarray(rng.uniform(
low=-4 * np.sqrt(6. / (self.hidden_size + self.n)),
high=4 * np.sqrt(6. / (self.hidden_size + self.n)),
size=(self.n, self.hidden_size)), dtype=th.config.floatX)
self.W = th.shared(value=initial_W, name='W', borrow=True)
self.b1 = th.shared(name='b1', value=np.zeros(shape=(self.hidden_size,),
dtype=th.config.floatX),borrow=True)
self.b2 = th.shared(name='b2', value=np.zeros(shape=(self.n,),
dtype=th.config.floatX),borrow=True)
self.activation_function=activation_function
self.output_function=output_function

def train(self, n_epochs=100, mini_batch_size=1, learning_rate=0.1):
index = T.lscalar()
x=T.matrix('x')
params = [self.W, self.b1, self.b2]
hidden = self.activation_function(T.dot(x, self.W)+self.b1)
output = T.dot(hidden,T.transpose(self.W))+self.b2
output = self.output_function(output)

#Use cross-entropy loss.
L = -T.sum(x*T.log(output) + (1-x)*T.log(1-output), axis=1)
cost=L.mean()

#Return gradient with respect to W, b1, b2.

#Create a list of 2 tuples for updates.
for param, gparam in zip(params, gparams):

#Train given a mini-batch of the data.
givens={x:self.X[index:index+mini_batch_size,:]})

import time
start_time = time.clock()
for epoch in xrange(n_epochs):
print "Epoch:",epoch
for row in xrange(0,self.m, mini_batch_size):
train(row)
end_time = time.clock()
print "Average time per epoch=", (end_time-start_time)/n_epochs

def get_hidden(self,data):
x=T.dmatrix('x')
hidden = self.activation_function(T.dot(x,self.W)+self.b1)
transformed_data = th.function(inputs=[x], outputs=[hidden])
return transformed_data(data)

def get_weights(self):
return [self.W.get_value(), self.b1.get_value(), self.b2.get_value()]


And that is it! Now we should test this puppy out, so let’s find some data.

MNIST

The MNIST data is famous in machine learning circles, it consists of single handwritten digits. Nearly every paper written on neural networks and so on tests their contribution on the data. The task is to classify the digits, but we will just test our autoencoder. First we will steal a very nice helper function to load the data set, this will download it for you if you don’t have it already. You can find it here.

import cPickle
import gzip
import os


''' Loads the dataset

:type dataset: string
:param dataset: the path to the dataset (here MNIST)
'''

#############
# LOAD DATA #
#############

# Download the MNIST dataset if it is not present
data_dir, data_file = os.path.split(dataset)
if data_dir == "" and not os.path.isfile(dataset):
# Check if dataset is in the data directory.
new_path = os.path.join(os.path.split(__file__)[0], "..", "data", dataset)
if os.path.isfile(new_path) or data_file == 'mnist.pkl.gz':
dataset = new_path

if (not os.path.isfile(dataset)) and data_file == 'mnist.pkl.gz':
import urllib
origin = 'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz'
urllib.urlretrieve(origin, dataset)

# Load the dataset
f = gzip.open(dataset, 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()
#train_set, valid_set, test_set format: tuple(input, target)
#input is an numpy.ndarray of 2 dimensions (a matrix)
#which row's correspond to an example. target is a
#numpy.ndarray of 1 dimensions (vector)) that have the same length as
#the number of rows in the input. It should give the target
#target to the example with the same index in the input.

return (train_set, valid_set, test_set)

path="/home/harri/Dropbox/Work/Blogs/triangleinequality/Theano/data/mnist.pkl.gz"



We also want a plotting function to visualize the features learned, I won’t explain this here, but if you google “visualizing features of neural network” you can learn plenty about it.

def plot_first_k_numbers(X,k):
from matplotlib import mpl,pyplot
m=X.shape[0]
k=min(m,k)
j = int(round(k / 10.0))

fig, ax = pyplot.subplots(j,10)

for i in range(k):

w=X[i,:]

w=w.reshape(28,28)
ax[i/10, i%10].imshow(w,cmap=pyplot.cm.gist_yarg,
interpolation='nearest', aspect='equal')
ax[i/10, i%10].axis('off')

pyplot.tick_params(\
axis='x',          # changes apply to the x-axis
which='both',      # both major and minor ticks are affected
bottom='off',      # ticks along the bottom edge are off
top='off',         # ticks along the top edge are off
labelbottom='off')
pyplot.tick_params(\
axis='y',          # changes apply to the x-axis
which='both',      # both major and minor ticks are affected
left='off',
right='off',    # ticks along the top edge are off
labelleft='off')

fig.show()


Now we put it together.

def m_test(data):
X=data[0][0]
activation_function = T.nnet.sigmoid
output_function=activation_function
A = AutoEncoder(X, 500, activation_function, output_function)
A.train(20,20)
W=np.transpose(A.get_weights()[0])
plot_first_k_numbers(W, 100, False)
plot_first_k_numbers(W,100, True)

m_test(data)


Running this, with 500 hidden units for 20 epochs with minibatch size of 20, the following features were learned:

Nothing too interesting yet, in order to get more useful features we could change the cost function to encourage sparsity, or introduce noise into the inputs, or use a different activation function. Hopefully this has been helpful for those just starting out with theano. Here is the source code.

## Neural Networks Part 2: Python Implementation

Ok so last time we introduced the feedforward neural network. We discussed how input gets fed forward to become output, and the backpropagation algorithm for learning the weights of the edges.

Today we will begin by showing how the model can be expressed using matrix notation, under the assumption that the neural network is fully connected, that is each neuron is connected to all the neurons in the next layer.

Once this is done we will give a Python implementation and test it out.

Matrix Notation For Neural Networks

Most of  this I learned from here.

In what follows, vectors are always thought of as columns, and so the transpose a row.

So first off we have $x \in \mathbb{R}^k$, our input vector, and $y \in \mathbb{R}^m$ our output vector.

Our neural network $N$ will have $N$ layers $L_1, \dots , L_n$ with $L_1$   the input layer, and $L_n$ the output layer.

For each layer $L_i$:

• write $n_i:=|L_i|$ for the number of neurons in that layer (not including the bias neurons);
• write $f_i: \mathbb{R}^{n_i} \to \mathbb{R}^{n_i}$ for the layer’s activation function;
• write $f_i': \mathbb{R}^{n_i} \to \mathbb{R}^{n_i}$ for vector of derivatives of the activation functions;
• write $I_i \in \mathbb{R}^{n_i}$ for the vector of inputs to the $ith$ layer;
• write $b_i \in \mathbb{R}^{n_{i}}$ for the vector of biases;
• write $O_i \in \mathbb{R}^{n_i}= f_i(I_i)$ for the vector of outputs from the $ith$ layer;
• write $O_i' \in \mathbb{R}^{n_i}= f_i'(I_i)$;
• write $W_i$ for the $n_{i+1} \times n_i$ real matrix containing the weights of the edges going from $L_i \to L_{i+1}$ (not including bias neurons), so that $W_iO_i + b_i = I_{i+1}$.

Feeding forward in vector notation

So to feedforward the input we have

$x \mapsto W_1x +b_1 \mapsto f_2(W_1x+b_1) \mapsto W_2f_2(W_1x)+b_2 \mapsto \dots \mapsto f_n(W_{n-1} f_{n-1}( \dots W_2f_2(W_1x) ))$.

And we call this output $f_N(x)$.

Backpropagation in vector notation

Now write $E_i \in \mathbb{R}^{n_i}$ for the error at the layer $i$ defined $E_i:=\frac{\partial e}{\partial I_i}$.

Then $E_n = O_n' \cdot (O_n-y)$, and $E_{i-1} = \left (W_{i-1}^TE_i\right ) \cdot O_{i-1}'$, where the multiplication $\cdot$ is component wise.

Write $\triangle W_i$ for the matrix with entries $(\frac{\partial e}{\partial w_{ij}})_{ij}$, for non-bias weights.

Then we have $\triangle W_i = \cdot E_{i+1} \cdot O_i^T$, the outer product of $E_{i+1}$ and $O_i$.

And for the biases write $\triangle b_i \in \mathbb{R}^{n_i}$ for the vector with $jth$ component $\frac{\partial e}{\partial (b_i)_j}$.

Then $\triangle b_i =E_{i+1}$.

Python Implementation

We will create a class NeuralNetwork and perform our calculations using matrices in numpy.

import numpy as np

class NeuralNetwork(object):
def __init__(self, X, y, parameters):
#Input data
self.X=X
#Output data
self.y=y
#Expect parameters to be a tuple of the form:
#    ((n_input,0,0), (n_hidden_layer_1, f_1, f_1'), ...,
#     (n_hidden_layer_k, f_k, f_k'), (n_output, f_o, f_o'))
self.n_layers = len(parameters)
#Counts number of neurons without bias neurons in each layer.
self.sizes = [layer[0] for layer in parameters]
#Activation functions for each layer.
self.fs =[layer[1] for layer in parameters]
#Derivatives of activation functions for each layer.
self.fprimes = [layer[2] for layer in parameters]
self.build_network()

def build_network(self):
#List of weight matrices taking the output of one layer to the input of the next.
self.weights=[]
#Bias vector for each layer.
self.biases=[]
#Input vector for each layer.
self.inputs=[]
#Output vector for each layer.
self.outputs=[]
#Vector of errors at each layer.
self.errors=[]
#We initialise the weights randomly, and fill the other vectors with 1s.
for layer in range(self.n_layers-1):
n = self.sizes[layer]
m = self.sizes[layer+1]
self.weights.append(np.random.normal(0,1, (m,n)))
self.biases.append(np.random.normal(0,1,(m,1)))
self.inputs.append(np.zeros((n,1)))
self.outputs.append(np.zeros((n,1)))
self.errors.append(np.zeros((n,1)))
#There are only n-1 weight matrices, so we do the last case separately.
n = self.sizes[-1]
self.inputs.append(np.zeros((n,1)))
self.outputs.append(np.zeros((n,1)))
self.errors.append(np.zeros((n,1)))

def feedforward(self, x):
#Propagates the input from the input layer to the output layer.
k=len(x)
x.shape=(k,1)
self.inputs[0]=x
self.outputs[0]=x
for i in range(1,self.n_layers):
self.inputs[i]=self.weights[i-1].dot(self.outputs[i-1])+self.biases[i-1]
self.outputs[i]=self.fs[i](self.inputs[i])
return self.outputs[-1]

def update_weights(self,x,y):
#Update the weight matrices for each layer based on a single input x and target y.
output = self.feedforward(x)
self.errors[-1]=self.fprimes[-1](self.outputs[-1])*(output-y)

n=self.n_layers-2
for i in xrange(n,0,-1):
self.errors[i] = self.fprimes[i](self.inputs[i])*self.weights[i].T.dot(self.errors[i+1])
self.weights[i] = self.weights[i]-self.learning_rate*np.outer(self.errors[i+1],self.outputs[i])
self.biases[i] = self.biases[i] - self.learning_rate*self.errors[i+1]
self.weights[0] = self.weights[0]-self.learning_rate*np.outer(self.errors[1],self.outputs[0])
self.biases[0] = self.biases[0] - self.learning_rate*self.errors[1]
def train(self,n_iter, learning_rate=1):
#Updates the weights after comparing each input in X with y
#repeats this process n_iter times.
self.learning_rate=learning_rate
n=self.X.shape[0]
for repeat in range(n_iter):
#We shuffle the order in which we go through the inputs on each iter.
index=list(range(n))
np.random.shuffle(index)
for row in index:
x=self.X[row]
y=self.y[row]
self.update_weights(x,y)

def predict_x(self, x):
return self.feedforward(x)

def predict(self, X):
n = len(X)
m = self.sizes[-1]
ret = np.ones((n,m))
for i in range(len(X)):
ret[i,:] = self.feedforward(X[i])
return ret


And we’re done! Now to test it we generate some synthetic data and supply the network with some activation functions.

def logistic(x):
return 1.0/(1+np.exp(-x))

def logistic_prime(x):
ex=np.exp(-x)
return ex/(1+ex)**2

def identity(x):
return x

def identity_prime(x):
return 1


First we will try to get it to approximate a sine curve.

#expit is a fast way to compute logistic using precomputed exp.
from scipy.special import expit
def test_regression(plots=False):
#First create the data.
n=200
X=np.linspace(0,3*np.pi,num=n)
X.shape=(n,1)
y=np.sin(X)
#We make a neural net with 2 hidden layers, 20 neurons in each, using logistic activation
#functions.
param=((1,0,0),(20, expit, logistic_prime),(20, expit, logistic_prime),(1,identity, identity_prime))
#Set learning rate.
rates=[0.05]
predictions=[]
for rate in rates:
N=NeuralNetwork(X,y,param)
N.train(4000, learning_rate=rate)
predictions.append([rate,N.predict(X)])
import matplotlib.pyplot as plt
fig, ax=plt.subplots(1,1)
if plots:
ax.plot(X,y, label='Sine', linewidth=2, color='black')
for data in predictions:
ax.plot(X,data[1],label="Learning Rate: "+str(data[0]))
ax.legend()
test_regression(True)


When I ran this it produced the following:

Next we will try a classification problem with a nonlinear decision boundary, how about being above and below the sine curve?

def test_classification(plots=False):
#Number samples
n=700

n_iter=1500
learning_rate=0.05

#Samples for true decision boundary plot
L=np.linspace(0,3*np.pi,num=n)
l = np.sin(L)

#Data inputs, training
X = np.random.uniform(0, 3*np.pi, size=(n,2))
X[:,1] *= 1/np.pi
X[:,1]-= 1

#Data inputs, testing
T = np.random.uniform(0, 3*np.pi, size=(n,2))
T[:,1] *= 1/np.pi
T[:,1] -= 1

#Data outputs
y = np.sin(X[:,0]) <= X[:,1]

#Fitting
param=((2,0,0),(30, expit, logistic_prime),(30, expit, logistic_prime),(1,expit, logistic_prime))
N=NeuralNetwork(X,y, param)
#Training
N.train(n_iter, learning_rate)
predictions_training=N.predict(X)
predictions_training= predictions_training <0.5
predictions_training= predictions_training[:,0]
#Testing
predictions_testing=N.predict(T)
predictions_testing= predictions_testing <0.5
predictions_testing= predictions_testing[:,0]

#Plotting
import matplotlib.pyplot as plt
fig, ax=plt.subplots(2,1)

#Training plot
#We plot the predictions of the neural net blue for class 0, red for 1.
ax[0].scatter(X[predictions_training,0], X[predictions_training,1], color='blue')
not_index = np.logical_not(predictions_training)
ax[0].scatter(X[not_index,0], X[not_index,1], color='red')
ax[0].set_xlim(0, 3*np.pi)
ax[0].set_ylim(-1,1)
#True decision boundary
ax[0].plot(L,l, color='black')
#Shade the areas according to how to they should be classified.
ax[0].fill_between(L, l,y2=-1, alpha=0.5)
ax[0].fill_between(L, l, y2=1, alpha=0.5, color='red')

#Testing plot
ax[1].scatter(T[predictions_testing,0], T[predictions_testing,1], color='blue')
not_index = np.logical_not(predictions_testing)
ax[1].scatter(T[not_index,0], T[not_index,1], color='red')
ax[1].set_xlim(0, 3*np.pi)
ax[1].set_ylim(-1,1)
ax[1].plot(L,l, color='black')
ax[1].fill_between(L, l,y2=-1, alpha=0.5)
ax[1].fill_between(L, l, y2=1, alpha=0.5, color='red')

test_classification()


When I ran this it looked like this.

The top plot shows the performance on the training data, and the bottom the performance of on the testing data. The points are colored according to the net’s predictions, whereas the areas are shaded according to the true classifications.

As you can see neural networks are capable of giving very good models, but the number of iterations and hidden nodes may be large. Less iterations are possible using good configurations of the network i terms of sizes and numbers of hidden layers, and choosing the learning rate well, but it is difficult to know how to choose these well.

It ended up being quicker to choose a larger number of hidden nodes, a large number of iterations, and a small learning rate, than to experiment finding a good choice. In the future I may write about how to use momentum to speed up training.

The code for this post is available here, please comment if anything is unclear or incorrect!

## Neural Networks Part 1: Feedingforward and Backpropagation

Introduction

Last time we looked at the perceptron model for classification, which is similar to logistic regression, in that it uses a nonlinear transformation of a linear combination of the input features to give an output class.

A neural network generalizes this idea by conceiving of a network of nodes which each take as an input a linear combination of the outputs of other nodes, then use some activation function to turn this into an output, such as:

A Feedforward Neural Network

A neural network is just a directed graph. We will be looking at feedforward neural networks, which are acyclic.

Each node is called a neuron. There are three main kinds of neurons:

• Input neurons, one neuron for each feature, so if our input were vectors in $\mathbb{R}^3$, we would have 3 input neurons. These neurons have trivial activation functions and no incoming edges.
• Output neurons, one neuron for each output feature, so if our so if our output were vectors in $\mathbb{R}^3$, we would have 3 output neurons. These will typically have a nonlinear activation function, chosen so that the output range matches our training data. These neurons have no outgoing edges.
• Hidden neurons are in between input and output neurons and have both incoming and outgoing edges.

Since the graph is acyclic, we can think of it as being composed of layers, as in the following picture taken from wikipedia:

although note there can be many hidden layers. Each edge has a weight, a real number.

The neural network makes a prediction by feeding forward the input. To describe this we develop some notation. The discussion that follows is indebted to this very helpful page.

Notation

Let $N$ be a neural network and $G$ index the vertices of $N$ so that each $N_i$ is a vertex for $i \in G$.

For each $i \in G$ let $T_i$ denote the set of indices of the targets of $N_i$. That is the set of indices $X \subset G$ such for each $j \in X$ there exists a directed edge $e_{ij}$ from $N_i \to N_j$. We may also conflate $T_i$ with the set of indexed neurons if the context makes this unambiguous.

Similarly, for each $i \in G$ let $S_i$ denote the set of indices of the sources of $N_i$.

For each index $i$, let $f_i: \mathbb{R} \to \mathbb{R}$ denote the activation function of $N_i$.

For each index $i$, let $O_i(x) \in \mathbb{R}:=f_i(x)$ be the output of $N_i$. When the argument $x$ is understood, we may simply write $O_i$.

For each directed edge $e_{ij}$ define $w_{ij} \in \mathbb{R}$ to be the weight of the edge.

Where the outputs are understood, for each index $i$, define the input $I_i$ to be $\sum \limits _{j \in S_i}w_{ji}O_j$, the weighted sum of the outputs of the previous layer.

Feeding Forward

Feeding forward refers to how an input becomes an output or prediction.

As you might have guessed: feeding forward an input vector $x$, the input and output to each input neuron is just the corresponding component of the vector $x$.

From there the input to a neuron in the first hidden layer is simply the sum of the components of $x$, weighted by the weights of the incoming edges to that hidden node.

You repeat this for each hidden node. Then the inputs to the hidden nodes are transformed via their activation functions to their outputs. The weighted sums of these outputs are then passed on as inputs to the next layer, and so on.

This is probably best understood through a simple example.

Here we have 4 neurons. The input $x$is two dimensional, so we have two input neurons shown in blue. We have a single hidden layer with just one hidden neuron, shown in green. Finally we have a single output neuron shown in orange. Hence the output $y$ is one dimensional.

Choosing the Weights: Backpropagation

So we have seen how a neural network can make predictions, but how can we choose the weights so that the predictions are any good? The answer is the same as in logistic regression, we use gradient descent. In order to do this we need to use differentiable activation functions and choose an appropriate error function to minimize.

If $x \in \mathbb{R}^k$ is the input vector, and $y \in \mathbb{R}^n$ is the target vector, let $f_N: \mathbb{R}^k \to \mathbb{R}^n$ be the prediction of the neural network. Then we define the error to be $e(x)= \frac{1}{2}(f_N(x)-y)^2$. This is the natural Euclidean norm of the difference between the target and the prediction. The factor of a half is there for aesthetical reasons, since we will be differentiating and gaining a factor of 2.

The trick to the algorithm is to figure out how the weights contribute to the error, and backpropagation is an elegant way of assigning credit/blame to each node and edge. We fix an input $x$, and write $e$ for $e(x)$, and for each each edge $e_{ij}$ we are seeking $\frac{\partial e}{\partial w_{ij}}$.  Once we have found this, we set $w_{ij} \mapsto w_{ij} - \lambda \frac{\partial e}{\partial w_{ij}}$, where $\lambda$ is the learning rate.  So we are doing stochastic gradient descent as in the perceptron.

It is clear that an edge $e_{ij}$ and its weight $w_{ij}$ contribute to the error through the target $N_j$‘s input. So it makes sense to divide and conquer using the chain rule, that is

$\frac{\partial e}{\partial w_{ij}} = \frac{\partial e}{\partial I_j} \cdot \frac{\partial I_j}{\partial w_{ij}}$.

The first part of the product is worth giving a name,  we denote it by $E_j:=\frac{\partial e}{\partial I_j}$ and call it the error at the node.

The second part of the product is simple to calculate:

$\frac{\partial E_j}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}} \sum \limits _{k \in S_j}w_{kj}O_k = O_i$

Returning to $E_j$, we can calculate it recursively, assuming the errors for the neurons in $T_j$ have already been calculated.

Observe that $e$ can be thought of as a function of the variables $O_k$ for $k$ indexing any layer, and so using the chain rule we have

$E_j = \frac{\partial e}{\partial I_j}$

$= \sum \limits_{k \in T_j} \frac{\partial e}{\partial I_k}\cdot \frac{\partial I_k}{\partial I_j}$

$= \sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial I_j}$

$=\sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial O_j} \cdot \frac{\partial O_j}{\partial I_j}$

$=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial O_j}$

$=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot \left ( \frac{\partial}{\partial O_j} \sum \limits _{l \in S_k}w_{lk}O_l \right )$

$=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot w_{jk}$

notice that if we assume that each layer is fully-connected to the next, then we could write the above succinctly as an inner product, and inner products can be calculated very efficiently, for instance using a GPU, so it is often faster to have more connections!

Together this gives

$\frac{\partial e}{\partial w_{ij}}=O_i \cdot f_j'(I_j)\cdot\sum \limits_{k \in T_j} E_k \cdot w_{jk}$.

Since we have calculated the error at a node recursively, we had better deal with the base case –  the output neurons. But this follows simply from the definition of  $e$ to be $E_i=f'(I_i)(O_i-y_i)$, with the factor of 2 cancelling with the half as promised.

Bias Neurons

The final piece of the puzzle are called bias neurons, which I had left out before to simplify the discussion. Each layer, except the output layer, has a bias neuron. A bias neuron has no inputs, and its output is always 1. Effectively it just adds the weight of its outgoing edges to the inputs of the next layer.

These behave just like regular neurons, only when updating their weights, we note that since their input is effectively constant, they have $\frac{\partial I_j}{\partial w_{ij}}=1$, and so we need only calculate the error at a bias neuron to update its weights.

Next Time

In the next installment I will show how to compactly represent the process of feeding forward and backpropagation using vector and matrix notation assuming the layers are fully connected. We will then use this to give a Python implementation.

## Enter the Perceptron

Introduction

Some time ago in my post on logistic regression, I talked about the problem where you have a $n \times p$ matrix $X$ representing $n$ data points with $p$ features, and a vector $y \in \{0,1 \}^n$ of classifications.

In linear regression you would produce a vector $\beta \in \mathbb{R}^p$ so that $X\beta \approx y$, but the problem here is that our output could be any real number, not $\{0,1\}$. The solution we investigated was using the logistic link function $p(x) = \frac{1}{1+\exp(-x)}$ which turns any real number into something we interpret as a probability.

That was one solution, another even simpler one is the function defined $s(x) = \left \{\begin{array}{cc} 1 & x \ge 0 \\ -1 &x<0\end{array} \right.$.  It is convenient in this case to think of the response as taking values in $\{-1,1 \}$ not $\{0,1\}$.

Previously we added a column of ones to $X$ in order to create an intercept term, here we will write it separately so so that $X \mapsto X\beta + b$ where $b \in \mathbb{R}^n$ is called the bias.

This post will take the book ‘The Elements of Statistical Learning’ as its guide (the section starting on p. 130), which is freely available online, but there is also a great blog post about it here.

The Algorithm

So define $f(x):=\beta^Tx+b$, we classify a point as being in the positive class if $f(x)>0$ and negative if $f(x) <0$. Geometrically we have an affine hyperplane $L:=\{x:f(x)=0 \}$, and points are classified depending on whether they are ‘above’ or ‘below’ the hyperplane.

The unit normal $u$ to $L$ is given by $\beta/ \Vert \beta \Vert$, points above $L$ then mean points with positive projection onto this vector.

The signed distance of a point $x$ to $L$ is found by projecting the vector $x-x_0$ onto $u$, where $x_0$ is the ‘origin’ of the hyperplane (since an affine hyperplane is a translation of a linear subspace, there is a point corresponding to the translation of the origin). So we get

$u^T(x-x_0)= \beta^T/ \Vert \beta \Vert (x-x_0) = (\beta^Tx+b)/ \Vert \beta \Vert = f(x)/ \Vert \beta \Vert = f(x)/ \Vert f'(x) \Vert$,

notice this means that $f(x)$ is proportional to the signed distance of $x$ from $L$.

If $M$ is the set of misclassified points, then we can measure the error as the sum of the distances (up to a constant of proportionality), of the misclassified points to $L$, that is

$D(\beta, b):=-\sum \limits _{x \in M}y_x f(x)$,

where $y_x$ is the response of $x$, notice that this makes each summand negative, which is reversed by the minus sign outside.

If we fix $M$ we can differentiate with respect to $\beta$ and $b$, giving

$\frac{\partial}{\partial \beta}D(\beta, b)= -\sum \limits _{x \in M}y_xx$,

and

$\frac{\partial}{\partial b}D(\beta, b)= -\sum \limits _{x \in M}y_x$.

Just like in logistic regression, we can use these partial derivatives to search for a minimum of $D$. The algorithm purposed by Frank Rosenblatt, the inventor of the perceptron, is to modify the hyperplane after looking at each individual misclassified item in $M$ as it comes, that is $(\beta, b) \mapsto (\beta,b) + \lambda(y_xx,y_x)$, where $\lambda >0$ is the learning rate. This has the advantage of being much faster than computing the global gradient informed by all of the data, and is an example of stochastic gradient descent.

Of course by changing the hyperplane as we go we may change $M$, and so what we actually do is iterate over each datapoint in a fixed order, and check if it is misclassified under the latest hyperplane, then update and move onto the next point. This is then very easy to code.

The Code

First let’s recycle some code we used for logistic regression to get some test data.

import numpy as np
from scipy import linalg
import random as rd
import matplotlib.pyplot as plt
def test_set(a,b, no_samples, intercept, slope, variance, offset):
#First we grab some random x-values in the range [a,b].
X1 = [rd.uniform(a,b) for i in range(no_samples)]
#Sorting them seems to make the plots work better later on.
X1.sort()
#Now we define our linear function that will underly our synthetic dataset: the true decision   boundary.
def g(x):
return intercept + slope*x
#We create two classes of points R,B. R points are given by g(x) plus a Gaussian noise term with positive mean,
#B points are given by g(x) minus a Gaussian norm term with positive mean.
R_turn=True
X2=[] #This will hold the y-coordinates of our data.
Y=[]  #This will hold the classification of our data.
for x in X1:
x2=g(x)
if R_turn:
x2 += rd.gauss(offset, variance)
Y.append(1)
else:
x2 -= rd.gauss(offset, variance)
Y.append(-1)
R_turn=not R_turn
X2.append(x2)
#Now we combine the input data into a single matrix.
X1 = np.array([X1]).T
X2 = np.array([X2]).T
X = np.hstack((X1, X2))
#Now we return the input, output, and true decision boundary.
return [ X,np.array(Y).T, map(g, X1)]


Fairly straightforward stuff I hope! If not then it should make sense when we plot it.

X,Y,g = test_set(a=0, b=5, no_samples=200, intercept=10, slope=0, variance=0.5, offset=1)
fig, ax = plt.subplots()
R =  X[::2] #Even terms
B =  X[1::2] #Odd terms

ax.scatter(R[:,0],R[:,1],marker='o', facecolor='red', label="Outcome 1")
ax.scatter(B[:,0],B[:,1],marker='o', facecolor='blue', label="Outcome 0")
ax.plot(X[:,0], g, color='green', label="True Decision Boundary")
ax.legend(loc="upper left", prop={'size':8})
plt.show()


This will produce something like:

Now to implement the algorithm.

def perceptron(X,y, learning_rate=0.1, max_iter=1000):
#Expecting X = n x m inputs, y = n x 1 outputs taking values in -1,1
m=X.shape[1]
n=X.shape[0]
weight = np.zeros((m))
bias = 0
n_iter=0
index = list(range(n))
while n_iter <= max_iter:
rd.shuffle(index)
for row in index:
if (X[row,:].dot(weight)+bias)*y[row] <=0: #Misclassified point.
weight += learning_rate*X[row,:]*y[row]
bias += learning_rate*y[row]
n_iter+=1
return {'weight':weight, 'bias':bias}


we shuffle the order we take the data points in at each step to get rid of any peculiarities in the order we update. Now let’s try it on our test data.

result= perceptron(X,Y,0.1)
def decision_boundary(x, weight, bias):
return (-bias - x*weight[0])/weight[1]

fig, ax = plt.subplots()
R =  X[::2] #Even terms
B =  X[1::2] #Odd terms

ax.scatter(R[:,0],R[:,1],marker='o', facecolor='red', label="Outcome 1")
ax.scatter(B[:,0],B[:,1],marker='o', facecolor='blue', label="Outcome -1")
ax.plot(X[:,0], g, color='green', label="True Decision Boundary")
ax.plot(X[:,0], [decision_boundary(x, result['weight'], result['bias']) for x in X[:,0]],
label='Predicted Decision Boundary')
ax.legend(loc="upper left", prop={'size':8})
plt.show()


Producing:

Not bad! Get the code here.

## My First Genetic Algorithm

‘Genetic’ is one of those words that make things cool, some others are topological, chaos or chaotic, fractal, quantum and recursive. That was not intended to be an exhaustive list. So for a while I have wanted to have a go at making a genetic algorithm and I have finally gotten around to it.

How they work

First let me refer you to the wiki, then I’ll quickly talk about the things I think are important. The main idea is to use an search algorithm inspired by evolution to solve a problem.

There are four core ideas.

Parametrization of Solution

To begin with we assume that a solution takes the form of a binary vector of length $m$. Each component of a solution vector can be thought of as a gene, which can be on or off (1 or 0 resp.).

The vectors I will call individuals, or creatures, or candidate solutions, or vectors. The population is a set of individuals, to begin with we initialize the population randomly or however you like.

Fitness

We also assume that we have a function $f: 2^n: \to \mathbb{R}$, this measures an individual’s fitness. Fitness is how well the solution, parametrized by the binary vector, solves your problem. So we will assume that bigger numbers mean better solutions.

Breeding

Once we have a population, we assess their fitness.  Now we want to create new individuals, hopefully with better fitness. We do this by combing different parent solutions to create child solutions. One simple way to do this is called uniform crossover , a function taking solutions to solutions is called a genetic operator, a binary genetic operator is a crossover. Uniform crossover simply means that the child’s ith gene is equal to the father’s or mothers with equal probability (alert! hetronormative language in use! please evacuate the building! ). But not everybody get’s to be a parent, ideally you want the chance that a creature has a child to depend in some way on its fitness.

One way of selecting parents is called tournament selection. The idea is that you grab k creatures from your population, and the one with the highest fitness is selected. The parameter k then controls selection pressure.

Mutation

‘Mutation’ also happens to be one of the cool words. In this context it is an unary genetic operator, that we think of as perturbing a solution vector. There are many ways to do this, perhaps to the simplest is to flip each gene from on to off or vice versa with probability $p$ called the mutation rate.

The problem with this can be that it encourages genes to be on where there are few on and off when there are many on. If the number of on genes is part of what is interesting, say if your vector represented variable selection in a machine learning model, then you do not want the mutation to be biased towards a certain number of on genes.

This means that ideally you should use a kind of mutation suitable for your problem, but even if you don’t make the best choice, natural selection will combat biases in the mutation, more or less successfully.

Since I want the expected number of on genes to be the same after mutation, my approach is to, with probability $p$ pick an on gene at random and turn it off and with probability $p$ pick an off gene and turn it on, repeated $n$ times.

Ok I have talked enough about the theory, let’s make one! First we need a problem to solve.

The Problem

I will randomly choose an integer $k$ between $0$ and $2^n$. Your job is to guess what $k$ is.

The information you are given is that after each guess $\phi$, I will select $N$ random integers from the same range,  write it as a binary vector, and then I will then tell you how often either:

1. The random integer is both $\le \phi$ and $\le k$ coordinate-wise; or

2. The random integer is both not $\le \phi$ and  not  $\le k$ coordinate-wise.

So I will tell you an integer from $0, \dots, N$ that describes the size of the subset for which $\phi, k$ are upper bounds and the size of the subset for which $\phi, k$ are lower bounds.

The Code

You can view it statically here, or download the notebook and play with it yourself, or just get the source – happy evolutions!

## Creating Dummy Variables in Pandas

Last time we implemented logistic regression, where the data is in the form of a numpy array. But what if your data is not of that form? What if it is a pandas dataframe like the Kaggle Titanic data?

Specifically the problem is variables like ‘Title’ where we have four strings ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’ as values. The solution is to create a column for each value, for example ‘Title_Mr’, with values 0,1 depending on whether the data point has that value. Reading Wes McKinney’s book again, he explains what to do, and so I have created a handy little function using his example.

It will accept as input a dataframe, and a dictionary telling which variables are nominal. It will then replace each nominal variable in your dataframe with a set of dummy columns, and also update your data type dictionary. Simple but effective.

def dummy_variables(data, data_type_dict):
#Loop over nominal variables.
for variable in filter(lambda x: data_type_dict[x]=='nominal',
data_type_dict.keys()):

#First we create the columns with dummy variables.
#Note that the argument 'prefix' means the column names will be
#prefix_value for each unique value in the original column, so
#we set the prefix to be the name of the original variable.
dummy_df=pd.get_dummies(data[variable], prefix=variable)

#Remove old variable from dictionary.
data_type_dict.pop(variable)

#Add new dummy variables to dictionary.
for dummy_variable in dummy_df.columns:
data_type_dict[dummy_variable] = 'nominal'

#Add dummy variables to main df.
data=data.drop(variable, axis=1)
data=data.join(dummy_df)

return [data, data_type_dict]


Easy. Now once everything is numeric, we can as use np.asarray to cast our dataframe to a numpy array.