## Understanding scan() in Theano

This is just short post since the meat is in an ipython notebook. I discuss the scan function in Theano, used for looping, iteration over tensors and recurrences. If you are a complete beginner to theano, check out my previous post. It seemed a bit obscure when I first started trying to understand it, so I thought I would share my learning process and hopefully speed up yours!

My sources for learning were these two great resources: this and that.

OK check it out here and let me know what you think.

## Support Vector Machines and the Kernel Trick

Today we will be discussing a machine learning algorithm that is probably the most mathematically sophisticated of all the algorithms we have looked at so far.

Don’t let that daunt you though! Not all the details are essential for understanding the general idea, and it can certainly be used without understanding the maths, although if you are content with this then there’s not much point reading a blog like this!

This post is based on what I have learned from reading the relevant chapter in the excellent book: Elements of Statistical Learning.

The Setup

The premise is exactly the same as the perceptron algorithm, we have input data $x_1, \dots , x_N \in \mathbb{R}^k$ with class labels $y_1, \dots, y_N \in \{-1,1\}$.

In order to classify the inputs, we seek a seperating hyperplane $H$. Recall that a hyperplane is just a plane that may not pass through the origin, and can be described as

$H= \{ x \in \mathbb{R}^k : \langle x, \beta \rangle =b \}$ for some $\beta \in \mathbb{R}^k$ and $b \in \mathbb{R}$.

It is handy to assume that $\beta$ is a unit vector, and geometrically it is the normal vector to the hyperplane. I will refer to $b$ as the bias. If the bias is zero, then $H$ is just a regular plane, and if it is nonzero $H$ is parallel to this plane, the bias controls how far ‘above’ or ‘below’ the origin $H$ is. In fact if $H_0 =\{x: \langle x,\beta \rangle =0 \}$, then $H=H_0 + b\beta$.

If we define

$f: \mathbb{R}^k \to \mathbb{R}$

by

$f(x) = \langle x,\beta \rangle - b$,

then $H = f^{-1}(0)$.

Moreover, $f(x)$ is the signed distance of $x$ from $H$. Note that if $\beta$ was not assumed to be a unit vector, it would instead be proportional to the signed distance. If $f(x) >0$ we say that $x$ is above $H$, and if $f(x) < 0$ we say that $x$ is below $H$.

To classify a point $x$, we say it has label $1$ if above $H$, and $-1$ if below.

The Problem(s)

If this is possible, then the perceptron algorithm will converge to a hyperplane that correctly classifies all of the points.

One problem is that there are infinitely many such hyperplanes, intuitively we would like to choose one to maximize the distances of all the points from it. If there are points close to the decision boundary, then it is likely that unseen points from the same distribution might up on the wrong side of the hyperplane and be misclassified. The minimum distance of all the points to $H$ is called the margin of the decision boundary, so we would like to maximise the margin.

Another problem is that the data might be too mixed up for there to be a seperating hyperplane, no matter which one you choose some points get misclassified. What we need is a way of choosing the hyperplane that does the best job of trying to seperate the points.

Finally it may be that a hyperplane is a bad approximation to the true decision boundary, we need something wiggilier.

Let’s solve these problems in turn.

Maximizing the Margin

The first step in solving a machine learning problem is to figure out how to phrase it as an optimization problem, so let’s try that.

Leting $M$ denote the margin, we would like to

maximize $M$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge M$ for $i=1, \dots, N$, $M>0$ and $\Vert \beta \Vert =1$.

Note that $y_i f(x_i) >0$ implies that $x_i$ is correctly classified since their signs must be the same for the product to be positive.

Now we just do a litle finessing, notice that

$y_if(x_i) \ge M$, ie $y_i(\langle x_i, \beta \rangle -b) \ge M$

is the same as

$y_i(\langle x_i, \frac{\beta}{M} \rangle -\frac{b}{M}) \ge 1$,

and so the optimization problem is equivalent to:

maximize $M$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge 1$ for $i=1, \dots, N$, $M>0$ and $\Vert \beta \Vert =1/M$,

and so maximizing $M$ is the same as minimizing the norm of $\beta$ in this formulation, giving:

minimize $\Vert \beta \Vert$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge 1$ for $i=1, \dots, N$.

Finally minimizing the norm is the same as minimizing half the square of the norm, and the latter is much more convienient since it doesn’t have any of those pesky square roots:

minimize $\frac{1}{2}\Vert \beta \Vert^2$ over $\{\beta, b\}$,
subject to $y_if(x_i) \ge 1$ for $i=1, \dots, N$.

Now that we have the optimization problem setup, we use the Karush-Kuhn-Tucker conditions, normally when trying to minimize a function you look for places where the derivative is zero, and this the extension of that idea when you have constraints.

The conditions firstly say that

$\beta= \sum \limits _{i=1}^N \mu_i y_i x_i$,

and

$\sum \limits _{i=1}^n y_i \mu_i =0$,

where

$\mu_i \ge 0$ for $i=1, \dots, N$

and

$\mu_i y_i(1-\langle x_i, \beta \rangle)=0$ for $i=1, \dots, N$.

The last condition shows that $\mu_i > 0$ only for those $x_i$ such that $\langle x_i, \beta \rangle=1$, these are the $x_i$ that are ‘on the margin’, those input vectors closest to the optimal hyperplane.

This means that $\beta = \sum \limits _{i=1}^N \mu_i y_i x_i$ is determined only by these $x_i$ on the margin, we call these the support vectors.

The optimization problem can be solved using quadratic programming techniques, it is also sometimes useful to look at the equations given by the KKT conditions and convert it into the ‘dual’ optimization problem in terms of the $\alpha_i$.

Compromising

Life isn’t always easy, and sometimes the data is not linearly seperable. It is desirable that our algorithm be able to tolerate some errors and try to minimize them, otherwise, even if the data are linearly seperable, the algorithm may fall over itself trying to fit a few bad apples in the data, rather than going for a more natural fit that will hopefully generalize better.

We do this by introducing slack variables $\varepsilon_i \ge 0$ for $i=1, \dots, N$ and changing our constraint from

$y_if(x_i) \ge M$

to

$y_if(x_i) \ge M(1-\varepsilon_i)$,

and as before this changes to

$y_i f(x_i) \ge 1 -\varepsilon_i$.

Of course we want to minimize the errors, and so we can recast our optimization problem as

minimize $\frac{1}{2} \Vert \beta \Vert^2+ C\sum \limits_{i=1}^N\varepsilon_i$,

subject to $y_if(x_i) \ge 1 - \varepsilon_i$, $\varepsilon_i \ge 0$ for $i=1, \dots, N$,

where $C$ is a cost parameter that we need to choose, controlling our tolerance for mistakes.

The KKT conditions for this new problem are

$\beta = \sum \limits _{i=1}^N \alpha_i y_i x_i$,

$\sum \limits _{i=1}^n y_i \alpha_i =0$,

$\alpha_i = C - \mu_i$,

where $\mu_i \ge 0$ for $i=1, \dots, N$.

Instead of trying to minimize an objective function subject to several complex constraints, it is often easier to collect the constraints into the Lagrangian and minimize that, subject to simpler constraints, this is called the penalty method.

In our case the Lagrangian is

$L_P(\beta, b, \varepsilon, \alpha, \mu) = \frac{1}{2}\Vert \beta \Vert^2 + C \sum \limits _{i=1}^N \varepsilon_i - \sum \limits _{i=1}^n \mu_i \varepsilon_i - \sum \limits _{i=1}^N \alpha_i(y_i(x_i^T\beta+b)-(1-\varepsilon_i))$,

and minimizing our objective function is equivalent to minimizing $L_P$ with respect to $\beta, b, \varepsilon$ whilst at the same time maxmizing $L_P$ with respect to $\mu, \alpha$. This is why this is called a saddle-point method.

The constraints for this equivalent problem are $\alpha, \mu \ge 0$ and $\sum \limits_{i=1}^N \alpha_i y_i=0$, the others are taken care of automagically.

Now we can substitute the KKT condition equations above to simplify $L_P$, giving the dual Lagrangian

$L_D(\alpha) = \sum \limits _{i=1}^N \alpha_i - \frac{1}{2} \sum \limits_{i,j=1}^N \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle$.

So our new problem is

maximize $L_D$

subject to $\mu \ge 0, \alpha \ge 0$ and $\sum \limits _{i=1}^N \alpha_i y_i=0$,

but $\mu$ doesn’t appear in $L_D$! But then we notice that the KKT condition

$\alpha_i = C -\mu_i$

is equivalent to constraining $\alpha_i \in [0, C]$.

So our final formulation is

maximize $L_D$

subject to $0 \le \mu \le C$ and $\sum \limits _{i=1}^N \alpha_i y_i=0$.

A much simpler formulation! In fact this is a convex quadratic programming problem, so it can be solved using out of the box methods from a library.

The Kernel Trick

We’ve now seen how to cope with errors by using a cost parameter $C$, this is a form of regularization that can control overfitting. But what if the true decision boundary is geninuely something a bit more complex than a hyperplane? The solution to this problem is to change coordinates from $\mathbb{R}^k \to \mathbb{R}^m$ using some mapping $T$, then fit a hyperplane $H$ in this new space, then the pullback $T^{-1}H$ can be a nonlinear surface in $\mathbb{R}^k$.

This is of course a standard practice for any machine learning algorithm, feature engineering. What is special about support vector machines is that the function to be minimized is

$L_D(\alpha) = \sum \limits _{i=1}^N \alpha_i - \frac{1}{2} \sum \limits_{i,j=1}^N \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle$,

notice that you can compute this using only inner products in whatever coordinate system you use. This means that if you have a function

$k: \mathbb{R}^k \times \mathbb{R}^k \to \mathbb{R}$

such that

$k(x,y) = \langle T(x), T(y) \rangle$,

then there is no need to work directly in the transformed space at all! In fact you don’t even need to know what $T$ is, as long as you have $k$.

This is a big deal because this can be much more computationally efficient if you want to transform to a very high dimensional space. This is what is called the kernel trick, because we get to have the benefits of a richer feature space on the cheap.

So in order to do this, you just have to dream up a kernel $k$, to make sure that your $k$ actually defines an inner product in some (Hilbert) space, by Mercer’s Theorem what you need to check is that, for any choice of vectors $x_1, \dots, x_n$, the matrix $[k(x_i, x_j)]_{i,j}$ is positive semi-definite.

Some common choices are:

dth-degree polynomial kernel: $k(x,y)=(1+\langle x,y \rangle)^d$,

radial basis kernel: $k(x,y) = \exp(\gamma \Vert x-y \Vert^2)$.

I have decided not to do a python implementation this time, because I would have to get in to quadratic programming to do it properly, perhaps I will write a post on this topic in the future.

## Theano, Autoencoders and MNIST

Recently I have finally upgraded my ancient laptop. Other than the ability for me to play the occasional video game, this means that I now have a dedicated Nvidia graphics card, in particular one that supports something called CUDA.

Now that I have the hardware, I thought that it would be fun to check out this python library theano. As I understand it, it has two very interesting aspects.

One is that it does symbolic computation, that is it represents variables as in mathematics, and their manipulations.

A variable in computer science is something like a integer or a string or  a list etc…, a piece of information stored somewhere in memory with a name made up by the programmer, and the compiler takes care of associating the human understandable name to the address of the information in memory.

A variable in mathematics is like $x$ in $x^2 + x + 1$, an indeterminate quantity that can be manipulated using arithmetic operations, or using calculus and so on. So what in mathematics we call a variable corresponds roughly to a function in computer science.

The difference between symbolic computation and its variables, and regular functions, is that the programming language is aware of the mathematical nature of a variable, such that when you apply operations to variables to create new variables, it does this in an efficient way. The best way to understand this is a simple example. Let’s suppose we have two functions $f(x) =x, g(x) = -x$. Now we can combine these functions by adding them to make a new function $h(x) = f(x) + g(x)$.

Now in python say, when you call $h$ with an argument, it will pass this argument to both $f$ and $g$ and then add the result and return it.

If we are doing symbolic computation, and $f,g$ were variables with $f=-g$, then the variable $h=f+g=0$ constantly. And so the computer does no work at all when using $h$, because it knows that it must always be zero.

So one important aspect is that symbolic computation can simplify and hence often optimize functions automatically.

Another is that theano can automagically differentiate variables with respect to variables! This might not sound like a big deal, but it is. In machine learning, we often create a complicated function, like in logistic regression or neural networks, and then derive a helpful form for the derivative to do gradient descent to optimize it. In theano, all you have to do is specify the objective function, and it will let you do gradient descent automatically.

The second amazing thing about theano is that it compiles its theano functions into C, and so it often runs much much faster than native python or numpy. It can also make it easier to run computations on a gpu which can mean a 100x speedup for the right cases!

Hopefully I’ve made you curious, so  click here for an installation guide. On this linux machine it was very simple. I am dual booting windows 8.1 and Ubuntu at the moment, it is very tricky to get the gpu aspect to work on windows, but here is a guide for the brave. To check that the GPU aspect is working, run the test code here.

Introduction to Theano

I recommend that you read the tutorials here, and also this excellent ipython worksheet here.

Let’s little play with the basic functionality.

import theano as th
from theano import tensor as T

#To make a variable, specify the type of variable from the tensor
#library.

#We can have scalars.
x=T.dscalar(name='scalar_x')

#Note that the name is just a string that theano will use
#to communicate to us about the variable.
print x
print x.type
#We can also have vectors.
v = T.dvector(name='vector_v')
print v
print v.type
#And matrices.
A = T.dmatrix(name='matrix_A')
print A
print A.type

#We can also make new variables using standard mathematical operations.
x_2 = x*x
print x_2
print x_2.type

#We can also make python functions which return new variables.
def power(variable, n):
return variable**n
x_10 = power(x,10)
print x_10
print x_10.type

#We can of course do standard linear algebra operations also.
Av = T.dot(A,v)
print Av
print Av.type

#Of course when a variable is actually evaluated, you must
#ensure that the dimensions are correct, or you will
#get an error.

#To see the value of a variable for a particular value of the variables
#comprising it, we make a theano function.
f = th.function([A,v], [Av])
#The syntax is a list of input variables, and then a list of output variables.

#The python code takes a little longer to run initially, because thenao
#compiles the function into C, but thereafter it will run extremely fast.

#Let's try using the function.
import numpy as np
m=np.ones((2,2))
print m
vec = np.asarray([1,2])
print vec
print f(m,vec)

#Now we can try computing gradients.
#First let's make a scalar variable by taking an inner product.
w=T.dvector(name='w_vector')
vTw = T.dot(v,w)

#Now we take the gradient with respect to w.

#Now let's test it.
vec1 = np.asarray([1,2])
vec2 = np.asarray([0,0])

#To evaulate a variable given inputs, there is another syntax
#in additon to creating a thenano function.

#Next we discuss theano's shared variables,
#these differ from regular theano variables in that
#theano variables only have a value within a theano function
#whereas shared theano variables have a value independent of
#being called in a function.
w=T.shared(name='shared_matrix', value=np.ones((2,3)))
print w
print w.type
print w.get_value()

#You can also set a shared variable's value.
w.set_value(np.zeros((2,3)))
print w.get_value()

#You can also have theano functions update shared variables,
#we illustrate this with a silly example that updates a matrix
x=T.dvector('x')
wx=T.dot(w,x)
cost = wx.mean()

#Notice the syntax of updates argument, should be a list
#of two tuples of the form: (variable_to_be_updated, updated_variable).
print f([1,1,1])
print w.get_value()

If you want to tinker with this, get the sourcecode here.

Autoencoders

I don’t want to go into too much detail here, but an autoencoder is a feedforward neural network with a single hidden layer with the twist being that its target output is exactly the same as its input. If the number of hidden neurons is less than the number of input/output neurons, then the network will be forced to learn some sort of compression.

Even when the number of hidden neurons is large, there are various tricks in cost functions you can use to ensure that the network does something more interesting than learning the identity function.

The hope of an autoencoder is that it learns some structure in the data, so that the hidden layer can be thought of as a feature space that one can use for a supervised learning algorithm.

Basically then we have $x \mapsto f_1(W_1x+b_1) =h$ the hidden layer, then $h \mapsto f_2(W_2h+b_2)$, where $b_1, b_2$ are bias vectors, $W_1, W_2$ are matrices, and $f_1, f_2$ are element-wise non-linearities like the logistic function.

A simplification that is often made is to enforce $W_1 = W_2^T$. This gives the model less free parameters to learn, and so is a form of regularization. It also uses less memory. This is referred to as an autoencoder with tied weights.

Autoencoders with Theano

Thanks to automatic differentiation, backprogation is effortless with theano, and thanks to gpu computation, much faster also! I recommend having a look at my python implementation of backpropagation just so you can see the effort saved. This implementation benefited greatly from this tutorial.

import numpy as np
import theano as th
from theano import tensor as T
from numpy import random as rng

class AutoEncoder(object):
def __init__(self, X, hidden_size, activation_function,
output_function):
#X is the data, an m x n numpy matrix
#where rows correspond to datapoints
#and columns correspond to features.
assert type(X) is np.ndarray
assert len(X.shape)==2
self.X=X
self.X=th.shared(name='X', value=np.asarray(self.X,
dtype=th.config.floatX),borrow=True)
#The config.floatX and borrow=True stuff is to get this to run
#fast on the gpu. I recommend just doing this without thinking about
#it until you understand the code as a whole, then learning more
self.n = X.shape[1]
self.m = X.shape[0]
#Hidden_size is the number of neurons in the hidden layer, an int.
assert type(hidden_size) is int
assert hidden_size > 0
self.hidden_size=hidden_size
initial_W = np.asarray(rng.uniform(
low=-4 * np.sqrt(6. / (self.hidden_size + self.n)),
high=4 * np.sqrt(6. / (self.hidden_size + self.n)),
size=(self.n, self.hidden_size)), dtype=th.config.floatX)
self.W = th.shared(value=initial_W, name='W', borrow=True)
self.b1 = th.shared(name='b1', value=np.zeros(shape=(self.hidden_size,),
dtype=th.config.floatX),borrow=True)
self.b2 = th.shared(name='b2', value=np.zeros(shape=(self.n,),
dtype=th.config.floatX),borrow=True)
self.activation_function=activation_function
self.output_function=output_function

def train(self, n_epochs=100, mini_batch_size=1, learning_rate=0.1):
index = T.lscalar()
x=T.matrix('x')
params = [self.W, self.b1, self.b2]
hidden = self.activation_function(T.dot(x, self.W)+self.b1)
output = T.dot(hidden,T.transpose(self.W))+self.b2
output = self.output_function(output)

#Use cross-entropy loss.
L = -T.sum(x*T.log(output) + (1-x)*T.log(1-output), axis=1)
cost=L.mean()

#Return gradient with respect to W, b1, b2.

#Create a list of 2 tuples for updates.
for param, gparam in zip(params, gparams):

#Train given a mini-batch of the data.
givens={x:self.X[index:index+mini_batch_size,:]})

import time
start_time = time.clock()
for epoch in xrange(n_epochs):
print "Epoch:",epoch
for row in xrange(0,self.m, mini_batch_size):
train(row)
end_time = time.clock()
print "Average time per epoch=", (end_time-start_time)/n_epochs

def get_hidden(self,data):
x=T.dmatrix('x')
hidden = self.activation_function(T.dot(x,self.W)+self.b1)
transformed_data = th.function(inputs=[x], outputs=[hidden])
return transformed_data(data)

def get_weights(self):
return [self.W.get_value(), self.b1.get_value(), self.b2.get_value()]

And that is it! Now we should test this puppy out, so let’s find some data.

MNIST

The MNIST data is famous in machine learning circles, it consists of single handwritten digits. Nearly every paper written on neural networks and so on tests their contribution on the data. The task is to classify the digits, but we will just test our autoencoder. First we will steal a very nice helper function to load the data set, this will download it for you if you don’t have it already. You can find it here.

import cPickle
import gzip
import os

:type dataset: string
:param dataset: the path to the dataset (here MNIST)
'''

#############
#############

data_dir, data_file = os.path.split(dataset)
if data_dir == "" and not os.path.isfile(dataset):
# Check if dataset is in the data directory.
new_path = os.path.join(os.path.split(__file__)[0], "..", "data", dataset)
if os.path.isfile(new_path) or data_file == 'mnist.pkl.gz':
dataset = new_path

if (not os.path.isfile(dataset)) and data_file == 'mnist.pkl.gz':
import urllib
origin = 'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz'
urllib.urlretrieve(origin, dataset)

f = gzip.open(dataset, 'rb')
f.close()
#train_set, valid_set, test_set format: tuple(input, target)
#input is an numpy.ndarray of 2 dimensions (a matrix)
#which row's correspond to an example. target is a
#numpy.ndarray of 1 dimensions (vector)) that have the same length as
#the number of rows in the input. It should give the target
#target to the example with the same index in the input.

return (train_set, valid_set, test_set)

path="/home/harri/Dropbox/Work/Blogs/triangleinequality/Theano/data/mnist.pkl.gz"

We also want a plotting function to visualize the features learned, I won’t explain this here, but if you google “visualizing features of neural network” you can learn plenty about it.

def plot_first_k_numbers(X,k):
from matplotlib import mpl,pyplot
m=X.shape[0]
k=min(m,k)
j = int(round(k / 10.0))

fig, ax = pyplot.subplots(j,10)

for i in range(k):

w=X[i,:]

w=w.reshape(28,28)
ax[i/10, i%10].imshow(w,cmap=pyplot.cm.gist_yarg,
interpolation='nearest', aspect='equal')
ax[i/10, i%10].axis('off')

pyplot.tick_params(\
axis='x',          # changes apply to the x-axis
which='both',      # both major and minor ticks are affected
bottom='off',      # ticks along the bottom edge are off
top='off',         # ticks along the top edge are off
labelbottom='off')
pyplot.tick_params(\
axis='y',          # changes apply to the x-axis
which='both',      # both major and minor ticks are affected
left='off',
right='off',    # ticks along the top edge are off
labelleft='off')

fig.show()

Now we put it together.

def m_test(data):
X=data[0][0]
activation_function = T.nnet.sigmoid
output_function=activation_function
A = AutoEncoder(X, 500, activation_function, output_function)
A.train(20,20)
W=np.transpose(A.get_weights()[0])
plot_first_k_numbers(W, 100, False)
plot_first_k_numbers(W,100, True)

m_test(data)

Running this, with 500 hidden units for 20 epochs with minibatch size of 20, the following features were learned:

Nothing too interesting yet, in order to get more useful features we could change the cost function to encourage sparsity, or introduce noise into the inputs, or use a different activation function. Hopefully this has been helpful for those just starting out with theano. Here is the source code.

## How to Balance your Binary Search Trees – AVL Trees

Last time we introduced the binary search tree (BST) and saw that they could do inserts and deletions in $O(h)$ time where $h$ is the height of the tree. We also saw that it could find the successor or predecessor to a node in the same time, and hence that it could sort a list in $O(nh)$ time, where $n$ is the length of the list.

Ideally we would have $h=O(\log n)$, and this is often the case ‘on average’, but the worst case is $h=n$, which will occur if we start with a sorted list! So today we are going to discuss how to have our tree try to balance itself automagically so as to keep its height $O(\log n)$.

AVL-Trees

Recall that the defining characteristic of a BST is the so-called ‘search property’: for any subtree rooted at a node, the stuff on the left is smaller, and the stuff on the right is greater.

We now introduce another property or invariant that we will insist upon and call it the ‘plus or minus property’ (pm) or ‘balance property’, and it is satisfied by a node if the height of its left subtree differs from the height of its right subtree by no more than one. We call a BST that satisfies pm an ‘AVL tree’ (AVL).

So we will give each node an extra field node.height, and node.balance, where node.balance  is equal to the height on the left minus the height on the right. Hence it will be 0 when they are equal, +1 when the left is larger, and -1 when the right is larger.

Before we go any further, you might be wondering whether or not this property forces the tree to have $O(\log n)$ height, and indeed it does, so let’s prove that to ourselves before coercing our trees to have this property.

Proof That AVL Trees are Balanced

Define $n_h$ to be the smallest number of nodes you can build a AVL of height $h$ with. That is, any AVL of height $h$ must have at least $n_h$ nodes.

Now suppose that we had shown that $h=O(\log n_h)$, or in particular that $h \le K \log n_h$ for all $h$. Now take any AVL tree of height $h$ and having $n$ nodes. Then $n \ge n_h \Rightarrow h \le K \log n_h \le K \log n \Rightarrow h=O(\log n)$ as desired.

So we just need to show that $h=O(\log n_h)$. The way to do this is by setting up a recurrence relation. Indeed we will see that $n_{h+2} = n_{h+1}+n_{h}+1$.

Now using this notice that:

$n_{h+2} = n_{h+1}+n_{h} + 1 \ge 2n_h$.

And so if $h=2k$ then applying this repeatedly yields:

$n_h \ge 2^{k-1} \cdot n_2 = 2^{k} = 2^{h/2}$.

On the other hand if $h=2k+1$ then we apply till we get to $n_1=1$ giving:

$n_h \ge 2^k \cdot n_1=2^k \ge 2^{h/2+1}$.

Combining these gives $n_h \ge 2^{h/2}$. Now taking logs base $2$ gives

$\log n_h \ge \frac{h}{2} \Rightarrow h=O(\log n_h) \Rightarrow h=O(\log n)$.

And we are done.

Keeping Track of the AVL Property

Now that we agree that the AVL or pm property is a thing worth having, we should alter our code for a BST so that the nodes have  height and balance fields, and so that the insertion and deletion methods update this accordingly. Once we have done that we will look at how to enforce the balance property.

Now we could use class inheritance and modify the BST class we built last time, but some of the code is a bit kludgy and so I will implement this from scratch, although I won’t go into detail about the stuff that is the same from last time.

We begin with the AVLNode and gift it with a sense of balance.

class AVLNode(object):
def __init__(self, key):
self.key=key
self.right_child=None
self.left_child=None
self.parent=None
self.height=0
self.balance=0

def update_height(self, upwards=True):
#If upwards we go up the tree correcting heights and balances,
#if not we just correct the given node.
if self.left_child is None:
#Empty left tree.
left_height = 0
else:
left_height = self.left_child.height+1
if self.right_child is None:
#Empty right tree.
right_height = 0
else:
right_height = self.right_child.height+1
#Note that the balance can change even when the height does not,
#so change it before checking to see if height needs updating.
self.balance = left_height-right_height
height = max(left_height, right_height)
if self.height != height:
self.height = height
if self.parent is not None:
#We only need to go up a level if the height changes.
if upwards:
self.parent.update_height()

def is_left(self):
#Handy to find out whether a node is a left or right child or neither.
if self.parent is None:
return self.parent
else:
return self is self.parent.left_child

I think the code, and comments, should explain everything if you read my last post. Let’s also pretend that we had a magic function balance(node), which will balance a node and then move up ancestors and do the same, we won’t worry about what exactly it will do, but add it into our insert routine.

class AVLTree(object):
def __init__(self):
self.root =None

def insert(self, key, node=None):
#The first call is slightly different.
if node is None:
#First call, start node at root.
node = self.root
if node is None:
#Empty tree, create root.
node = AVLNode(key=key)
self.root=node
return node
else:
ret= self.insert(key=key, node=node)
self.balance(ret)
return ret
#Not a first call.
if node.key ==key:
#No need to insert, key already present.
return node
elif node.key >key:
child = node.left_child
if child is None:
#Reached the bottom, insert node and update heights.
child = AVLNode(key=key)
child.parent=node
node.left_child = child
node.update_height()
return child
else:
return self.insert(key=key, node=child)
elif node.key < key:
child = node.right_child
if child is None:
#Reached the bottom, insert node and update heights.
child = AVLNode(key=key)
child.parent=node
node.right_child = child
return child
else:
return self.insert(key=key, node=child)
else:
print "This shouldn't happen."

def find(self, key, node=None):
if node is None:
#First call.
node=self.root
if self.root is None:
return None
else:
return self.find(key, self.root)
#Now we handle nonfirst calls.
elif node.key == key:
#Found the node.
return node
elif key < node.key:
if node.left_child is None:
#If key not in tree, we return a node that would be its parent.
return node
else:
return self.find(key,node.left_child)
else:
if node.right_child is None:
return node
else:
return self.find(key, node.right_child)

def delete(self, key, node=None):
#Delete key from tree.
if node is None:
#Initial call.
node = self.find(key)
if (node is None) or (node.key != key):
#Empty tree or key not in tree.
return

if (node.left_child is None) and (node.right_child is not None):
#Has one right child.
right_child = node.right_child
left = node.is_left()
if left is not None:
parent=node.parent
if not left:
parent.right_child=right_child
else:
parent.left_child=right_child
right_child.parent =parent
self.balance(parent)
else:
right_child.parent=None
self.root = right_child
#No need to update heights or rebalance.

elif (node.left_child is not None) and (node.right_child is None):
#Has one left child.
left_child = node.left_child
left= node.is_left()
if left is not None:
parent=node.parent
if left:
parent.left_child=left_child
else:
parent.right_child=right_child
left_child.parent =parent

self.balance(parent)
else:
left_child.parent=None
self.root=left_child
elif node.left_child is None:
#Has no children.
parent = node.parent
if parent is None:
#Deleting a lone root, set tree to empty.
self.root = None
else:
if parent.left_child is node:
parent.left_child =None
else:
parent.right_child=None
self.balance(parent)
else:
#Node has two childen, swap keys with successor node
#and delete successor node.
right_most_child = self.find_leftmost(node.right_child)
node.key = right_most_child.key
self.delete(key=node.key,node=right_most_child)
#Note that updating the heights will be handled in the next
#call of delete.

def find_rightmost(self, node):
if node.right_child is None:
return node
else:
return self.find_rightmost(node.right_child)

def find_leftmost(self, node):
if node.left_child is None:
return node
else:
return self.find_leftmost(node.left_child)

def find_next(self, key):
node = self.find(key)
if (node is None) or (node.key != key):
#Key not in tree.
return None
else:
right_child = node.right_child
if right_child is not None:
node= self.find_leftmost(right_child)
else:
parent = node.parent
while(parent is not None):
if node is parent.left_child:
break
node = parent
parent = node.parent
node=parent
if node is None:
#Key is largest in tree.
return node
else:
return node.key

def find_prev(self, key):
node = self.find(key)
if (node is None) or (node.key != key):
#Key not in tree.
return None
else:
left_child = node.left_child
if left_child is not None:
node= self.find_rightmost(left_child)
else:
parent = node.parent
while(parent is not None):
if node is parent.right_child:
break
node = parent
parent = node.parent
node=parent
if node is None:
#Key is largest in tree.
return node
else:
return node.key
#I also include a new plotting routine to show the balances or keys of the node.
def plot(self, balance=False):
#Builds a copy of the BST in igraphs for plotting.
#left and right children, we build it using a queue.
import igraph as igraphs
G = igraphs.Graph()
if self.root is not None:
queue = [[self.root,0]]
#Queue has a pointer to the node in our BST, and its index
#in the igraphs copy.
index=0
not_break=True
while(not_break):
#At each iteration, we label the head of the queue with its key,
#then add any children into the igraphs graph,
#and into the queue.

node=queue[0][0] #Select front of queue.
node_index = queue[0][1]
if not balance:
G.vs[node_index]['label']=node.key
else:
G.vs[node_index]['label']=node.balance
if index ==0:
#Label root green.
G.vs[node_index]['color']='green'
if node.left_child is not None:
queue.append([node.left_child,index+1])
G.vs[index+1]['color']='red' #Left children are red.
index+=1
if node.right_child is not None:
G.vs[index+1]['color']='blue'
queue.append([node.right_child, index+1])
index += 1

queue.pop(0)
if len(queue)==0:
not_break=False
layout = G.layout_reingold_tilford(root=0)
igraphs.plot(G, layout=layout)

If you want to make sure it works, I have written a small test function.

def test():
lst= [1,4,2,5,1,3,7,11,4.5]
B = AVLTree()
for item in lst:
print "inserting", item
B.insert(item)
B.plot(True)
print "End of inserts"
print "Deleting 5"
B.plot(True)
B.delete(5)
print "Deleting 1"
B.plot(True)
B.delete(1)
B.plot(False)
print B.root.key ==4
print B.find_next(3) ==4
print B.find_prev(7)==4.5
print B.find_prev(1) is None
print B.find_prev(7)==4.5
print B.find_prev(2) is None
print B.find_prev(11) == 7

Great now we know the balance of each node, but how to balance the tree?
Tree Rotations
Before solving our balancing problem I want to introduce a new tool, an operation called a tree rotation that respects the search property. Here’s what a tree rotation looks like:

So there are two moves that are inverses of one another, a right rotation and a left rotation. Staring at the diagram for a little bit you can also see that these moves preserve the search property. Let’s implement this, and then consider when to use it.

We’ll first create a right rotation, then swap the lefts for rights and vice versa to make a left rotation.

The argument to the function will always be the root, that is the parent of the two nodes, so in our diagram above, the label root is appropriate for the right-rotation.

We then move around the the parent-child pointers. Notice that the relation between the root and $C$, and the relation between pivot and $A$, is unchanged. The changes we have to make are that $B$ changes parents, and that root and pivot swap places from parent/child to child/parent.

So the left-child of root, pivot, is changed to the right-child of pivot, $C$,  the right-child of pivot is changed to root, the parent of root is changed to pivot and finally the previous parent of root, if any, becomes the new parent of pivot.

def right_rotation(self, root):
left=root.is_left()
pivot = root.left_child
if pivot is None:
return
root.left_child = pivot.right_child
if pivot.right_child is not None:
root.left_child.parent = root
pivot.right_child = root
pivot.parent = root.parent
root.parent=pivot
if left is None:
self.root = pivot
elif left:
pivot.parent.left_child=pivot
else:
pivot.parent.right_child=pivot
root.update_height(False)
pivot.update_height(False)

And analogously:

def left_rotation(self, root):
left=root.is_left()
pivot = root.right_child
if pivot is None:
return
root.right_child = pivot.left_child
if pivot.left_child is not None:
root.right_child.parent = root
pivot.left_child = root
pivot.parent = root.parent
root.parent=pivot
if left is None:
self.root = pivot
elif left:
pivot.parent.left_child=pivot
else:
pivot.parent.right_child=pivot
root.update_height(False)
pivot.update_height(False)

Now we ought to test it so using the same list as last time:

def test_rotation():
lst= [1,4,2,5,1,3,7,11,4.5]
print "List is ",lst
B = AVLTree()
for item in lst:
print "inserting", item
B.insert(item)
node=B.find(4)
B.plot()
B.left_rotation(node)
B.plot()
B.right_rotation(node.parent)
B.plot()
test_rotation()

Producing the following two plots:

Finally we show how to use tree rotations to keep a AVL tree balanced after insertions and deletions.

Bringing Balance to the Force

After an insertion or deletion, heights can change by at most 1. So assuming that the tree was balanced prior to the operation, if a node becomes unbalanced, it will have a balance of $\pm 2$.

Let’s look at the case where we have a node $N$ with balance $+2$ with some pictures, the other case is the same modulo lefts and rights.

This is called the left-left case. Alternatively we could have the left-right case:

This takes to a similar case to before:

To implement this is simple now that we have the right tools. We want to function to start with the changed node, either the inserted node in insertion, or the parent of the deleted node in deletion, and move up through the ancestors, updating heights and balancing as it goes.

def balance(self, node):
node.update_height(False)
if node.balance == 2:
if node.left_child.balance != -1:
#Left-left case.
self.right_rotation(node)
if node.parent.parent is not None:
#Move up a level.
self.balance(node.parent.parent)
else:
#Left-right case.
self.left_rotation(node.left_child)
self.balance(node)
elif node.balance ==-2:
if node.right_child.balance != 1:
#Right-right case.
self.left_rotation(node)
if node.parent.parent is not None:
self.balance(node.parent.parent)
else:
#Right-left case.
self.right_rotation(node.right_child)
self.balance(node)
else:
if node.parent is not None:
self.balance(node.parent)

We might as well make a sorting routine to wrap the AVL Tree now.

def sort(lst, ascending=True):
A = AVLTree()
for item in lst:
A.insert(item)
ret=[]
if ascending:
node=A.find_leftmost(A.root)
if node is not None:
key = node.key
else:
key=node
while (key is not None):
ret.append(key)
key=A.find_next(key)
else:
node=A.find_rightmost(A.root)
if node is not None:
key = node.key
else:
key=node
while (key is not None):
ret.append(key)
key=A.find_prev(key)
return ret

And that’s it! Please let me know if you spot any mistakes. You can download the source here.

## Binary Search Trees

So in preparation for starting a PhD in Machine Learning/Data Science at the University of Edinburgh .

I have been watching some of the lectures on algorithms and data structures at MIT here.

One of the data structures discussed is the binary search tree (BST), and so in this post I will explain what they are and give a python implementation.

Basics

As you might expect, a BST is a way of organizing data in a binary tree. It consists of a collection of nodes, each of which has a key, corresponding to a piece of data, say an integer. Importantly the keys must be comparable, so that for any two nodes $A, B$ either $A.key \le B.key$ or $B.key \le A.key$.

Each node also has (at most) three pointers: parent, left child and right child.

Most importantly we have the invariant of the BST. An invariant is, of course, something that stays the same, a property that is not altered under any of the data structures allowable operations. The invariant or search property is that for any given node $A$ all of the nodes in the left subtree rooted at $A$ have keys less than $A.key$, and all of the nodes in the right subtree have keys greater than $A.key$.

Any useful data structure has operations. Some obvious things to want to do are to insert and delete. Of course in order to do these things we have to take care not to mess up the search property. To insert we begin at the root and at each stage we ask whether the key to be inserted is lesser, greater or equal to the parent, in which case we move right, left or nowhere respectively. It’s easiest just to see an example:

Now I think you get the idea, so let’s start coding this up. First we will make the BST node.

class TreeNode(object):
#A tree node has a key, which can be compared with other keys,
#and possibly 'pointers' to a parent, left child or right child.
def __init__(self, key, parent=None, left_child=None,
right_child=None):
self.parent=parent
self.left_child=left_child
self.right_child=right_child
self.key=key

Next we will start making the BST. The BST remembers the root, and the pointers do the rest of the work. We begin by implementing a helpful method called ‘find’, which returns a node with a given key, or that could (but doesn’t) have a child with that key.

class BinarySearchTree(object):
def __init__(self, root=None):
self.root=root

def find(self, key):
#Returns a node corresponding to a key.
#If key is the key of some node, returns that node.
#If tree is empty, returns None.
#Otherwise returns a leaf that could accept
#a node with that key as a child.
#This function wraps a call to a recursive function self._find.
if self.root is None:
return None
else:
return self._find(key, self.root)

def _find(self, key, node):
#This is a recursive function that does all the work for
#self.find.
if key == node.key:
return node
elif key < node.key:
if node.left_child is None:
return node
else:
return self._find(key, node.left_child)
else:
if node.right_child is None:
return node
else:
return self._find(key, node.right_child)

Using ‘find’ we can then implement ‘insert’.

def insert(self, key):
#Inserts a node with given key.
#If key already in tree, then returns the node with that key.
#Otherwise creates a new node and returns it.
#This takes time of order the height of the tree.
node = self.find(key)
if node is None:
self.root=TreeNode(key)
return self.root
elif key == node.key:
return node
elif key < node.key:
left_child = TreeNode(key, parent=node)
node.left_child=left_child
return left_child
else:
right_child = TreeNode(key, parent=node)
node.right_child=right_child
return right_child

So far so good!

The next thing we want to do is be able to delete keys.

If a key belongs to a leaf node then this is simple we just delete the node.

If the node has one child, then if it has a parent then when we delete the node we must connect the parent to the child of the deleted node.

If the node has two children things are a bit more tricky. If we delete the node then we break the tree. What we need to do instead is replace the node’s key with something else in the tree. What we replace it with is the rightmost node of the subtree rooted at the node’s left child, as we shall see later this is the successor of the node in the BST. The following diagram should explain:
So we swap the node’s key for this new rightmode node’s, then delete the rightmost node. Since it is rightmost it cannot have a right child, and so it is one of our base cases for the deletion operation. If this is still not clear to you, do some examples on paper and you will soon see the idea. Let’s implement this.

def delete(self, key):
#Delete key from tree.
#If key is not in BST does nothing.
#Otherwise it calls a semi-recursive function _delete.
node = self.find(key)
if (node is None) or (node.key != key):
return
else:
self._delete(node)

def _delete(self, node):
#If the node has less than two children we can delete the node
#directly by removing it and then gluing the tree back together.
#If node has two children, it swaps a key lower down in the tree to
#replace the deleted node, and deletes the lower down node.
if (node.left_child is None) and (node.right_child is not None):
#Has one right child.
right_child = node.right_child
parent = node.parent
if parent is not None:
parent.right_child=right_child
right_child.parent =parent
else:
right_child.parent=None
if node is self.root:
self.root=right_child
elif (node.left_child is not None) and (node.right_child is None):
#Has one left child.
left_child = node.left_child
parent = node.parent
if parent is not None:
parent.left_child=left_child
left_child.parent =parent
else:
left_child.parent=None
if node is self.root:
self.root=left_child
elif node.left_child is None:
#Has no children.
parent = node.parent
if parent is None:
self.root = None
else:
if parent.left_child is not None:
left = parent.left_child is node
else:
left = False
if left:
parent.left_child =None
else:
parent.right_child=None
else:
right_most_child = self.find_leftmost(node.right_child)
node.key = right_most_child.key
self._delete(right_most_child)

def find_rightmost(self, node):
if node.right_child is None:
return node
else:
return self.find_rightmost(node.right_child)

def find_leftmost(self, node):
if node.left_child is None:
return node
else:
return self.find_leftmost(node.left_child)

Sorting

Ok great, we have now covered the basics of how to modify a BST. Now we want to know how to extract information from it. In particular given a key, we want to be able to find the next largest or smallest key in the BST, if any. Calling the successor (predecessor)  operation repeatedly, starting from the left(right)-most  node, will give us the sorted keys in ascending(descending) order.

I will explain how to find the successor, the predecessor case is of course very similar. Let’s suppose we want to find the successor of node $A$ with right-child $B$. In this case we choose the left-most child of $B$ which we will call $S$.

Note that in a subtree rooted at some node, the left-most node is the smallest element in the subtree, and the right-most node is the largest.

So in the subtree rooted at $A$, the right-subtree rooted at $B$ contains all the elements in the subtree greater than $A$. The left-most element of the right-subtree is then the smallest element in the subtree greater than $A$ and is a good candidate for a successor. Let’s draw a picture and try to see why this must in fact be the correct node.

OK, but what if $A$ has no right child? In this case we move up through the ancestors of $A$, searching for the first ancestor that is a left-child of its parent. If there is no such node, ie we get to the root, then $A$ is the right-most, and hence largest, element in the tree, and has no successor. To see that this is correct employ a similar argument to the previous case.

Let’s code that up.

def find_next(self, key):
node = self.find(key)
if (node is None) or (node.key != key):
#Key not in tree.
return
else:
ret= self._find_next(node)
if ret is None:
#Key is largest in tree.
return ret
else:
return ret.key

def _find_next(self, node):
right_child = node.right_child
if right_child is not None:
return self.find_leftmost(right_child)
else:
parent = node.parent
while(parent is not None):
if node is parent.left_child:
break
node = parent
parent = node.parent
return parent

def find_prev(self, key):
node = self.find(key)
if (node is None) or (node.key != key):
#Key not in tree.
return
else:
ret= self._find_prev(node)
if ret is None:
#Key is smallest in tree.
return ret
else:
return ret.key

def _find_prev(self, node):
left_child = node.left_child

if left_child is not None:
return self.find_rightmost(left_child)
else:
parent = node.parent
while(parent is not None):
if node is parent.right_child:
break
node = parent
parent = node.parent
return parent

By now I think we deserve to see the fruits of our hard work, so let’s whip up a plotting routine using igraphs. Don’t worry about the details: since I can’t find a binary tree plotting style in igraphs, I have colored the root green, left children red and right children blue.

In order to preserve information about left and right children, the plotting routine builds the graph in igraphs using a queue.

def plot(self):
#Builds a copy of the BST in igraphs for plotting.
#left and right children, we build it using a queue.
import igraph as igraphs
G = igraphs.Graph()
if self.root is not None:
queue = [[self.root,0]]
#Queue has a pointer to the node in our BST, and its index
#in the igraphs copy.
index=0
not_break=True
while(not_break):
#At each iteration, we label the head of the queue with its key,
#then add any children into the igraphs graph,
#and into the queue.

node=queue[0][0] #Select front of queue.
node_index = queue[0][1]
G.vs[node_index]['label']=node.key
if index ==0:
#Label root green.
G.vs[node_index]['color']='green'
if node.left_child is not None:
queue.append([node.left_child,index+1])
G.vs[index+1]['color']='red' #Left children are red.
index+=1
if node.right_child is not None:
G.vs[index+1]['color']='blue'
queue.append([node.right_child, index+1])
index += 1

queue.pop(0)
if len(queue)==0:
not_break=False
layout = G.layout_reingold_tilford(root=0)
igraphs.plot(G, layout=layout)

Now let’s test it!

def test():
lst= [1,4,2,5,1,3,7,11,4.5]
B = BinarySearchTree()
for item in lst:
B.insert(item)
B.plot()
B.delete(5)
B.delete(1)
B.plot()
print B.root.key
print B.find_next(3)
print B.find_prev(7)
print B.find_prev(1)

test()

This produced the following output:

and
“4
4
4.5
None” to the console, as desired.

You can probably see that the operations take $O(h)$ time where $h$ is the height of the tree. This means that we can sort a length $n$ list in $O(nh)$ time. On average $h = O(\log n)$, but if given a monotonic list we will get a linear tree with $h=n$. So next time we will see how to ‘balance’ the tree using tree rotations to achieve fast operations.

Please feel free to leave any comments, suggestions or corrections! Here is the source code.

## Entropy: Expect a Surprise

Defining Surprise

Let’s suppose we have a system with $n$ states $X=\{x_1, \dots, x_n \}$ which happen with probabilities $p(x_i)$.  On the other hand we have our model for the system which encode our beliefs or predictions about the states’ probabilities $q(x_i)$.

How surprised would we be if $x_i$ happens? What properties should the surprise have? Let’s make a wish list:

• First the surprise should be a function $s: [0,1] \to \mathbb{R}$ that takes a probability to a real number representing the surprise.
• Secondly if $q(x_i)=0$, then we will be infinitely surprised, so $s(0) = \infty$ in this case.
• Similarly if $q(x_i)=1$, then we won’t be surprised at all, so $s(1)=0$.
• We would also like $s$ to be a nice function to work with, so let’s have it be continuously differentiable.

One final property is that I would like surprise to be additive or extensive , this means that for independent events $E,F$, I would like $s(q(E \cap F)) = s(q(E)q(F))$ to be $s(q(E))+s(q(F))$. This is the main difference between probability and surprise, surprises add up, whereas probabilities multiply. I think it easier to think about adding so this makes me happy. So in general we want $s(ab)=s(a)+s(b)$.

If surprise doesn’t seem to be additive to you, most likely because you have been reared on probabilities, you can substitute ‘surprise’ for ‘information’.

If you play around with these constraints you will find you need $s$ to be a negative logarithm, that is $s(x) =\log \frac{1}{x}$. If we choose $e$ a our base, as mathematicians prefer, we measure it in nats, if we work base 2,  as computer scientists like, we have bits. This is also referred to as the self-information.

Defining Entropy

A natural quantity to compute is

$\mathbb{E}s = \sum \limits _{i=1}^np(x_i)\log\frac{1}{q(x_i)} = \mathbb{E}_p(-\log q)$,

the expected surprise. This is only defined when $q(x_i)=0 \Rightarrow p(x_i)=0$, and whenever $p(x_i)=0$, by convention we let $0 \cdot \log 0 = 0 \cdot \log 0/0=0$. If this offends you then you should alter the definition to take the sum over the support of $p$!

Natural quantities deserve special notation and so we write

$H(p,q)$ for $\mathbb{E}s$,

and refer to it as the cross entropy.

In the special case where $p=q$, where the model matches the true probabilities, we call $H(p,p)$ the entropy of $X$ and write $H(p)$ or $H(X)$ for this quantity. Using Jensen’s inequality, we can show that the entropy is maximised iff $p(x_i)=\frac{1}{n}$ for each outcome, in which case $H(X)=\log n$.

To get a feel for this maximization of entropy, we plot the binary case where we have just two outcomes. So letting the probability of the first outcome be $p$, let’s see how the surprise changes with $p$, and below that how the entropy changes:

As you can see, you can make the first outcome very surprising or informative, but you pay for that by making the complementary outcome less surprising, the lower graph shows that the optimum compromise comes from having both outcomes be equally likely.

Intuitively any approximate model $q$ will increase the surprise so that $H(p) \le H(p,q)$.

The difference $H(p,q)-H(p)$  is then a measure of much more surprised we are when using the model, this quantity is written $d_{KL}(p \Vert q)$ and called the Kullback-Leibler divergence, which is a kind of oriented distance between distributions, notice that it is not symmetric and so not a metric.

Our intuition that $H(p) \le H(p,q)$ can then rewritten as $d_{KL}(p \Vert q) \ge 0$, and in this form it is known as Gibbs’ inequality.

Proof of Gibbs’ Inequality

This is pretty simple, since $\log(x)$ is a concave function, $-\log x=\log \frac{1}{x}$ is convex, and we can use Jensen’s inequality, writing $p_i:=p(x_i)$ and $q_i:=q(x_i)$ we have:

$d_{KL}(p\Vert q)=\sum \limits _{i=1}^n p_i \log\frac{p_i}{q_i}=-\sum \limits _{i=1}^n p_i \log\frac{q_i}{p_i} \ge - \log \left (\sum \limits _{i=1}^np_i\cdot \frac{q_i}{p_i} \right )=-\log(1)=0$.

Also notice that $d_{KL}(p\Vert q)=0$ if $p=q$ so the inequality is sharp, and moreover, since we are dealing with a strictly convex function the divergence is zero if and only if $p=q$.

Conditional Entropy

Now for two random variables $X,Y$ we can consider the entropy of $Y | X=x$, $Y$ conditioned upon $X=x$. We can call this $H(Y|X=x)$, now as ever we take the expectation to give $H(Y|X):=\mathbb{E}H(Y|X=x)$ the conditional entropy of $Y$ given $X$.

Including more information can only mean less surprise so we must have

$H(Y|X) \le H(Y)$.

We also have something called the chain rule:

$H(Y|X) = H(X,Y)-H(X)$,

where $H(X,Y):=\sum \limits _{x,y}p(x,y)\log (\frac{1}{p(x,y)})$ is the joint entropy.

So we have

$H(Y|X)=\sum \limits_{x}p(x)H(Y|X=x) = \sum \limits_{x}p(x) \sum \limits_{y}p(y|x)\log(\frac{1}{p(y|x)}) = \sum \limits_{x,y}p(y|x)p(x)\log(\frac{1}{p(y|x)})$,

and using the definition of conditional probability this becomes

$\sum \limits_{x,y}p(y,x) \log(\frac{1}{p(y|x)})= \sum \limits_{x,y}p(y,x) \log(\frac{p(x)}{p(y,x)})$

which is just the joint entropy minus

$H(X,Y) - \sum \limits_{x,y}p(y,x) \log (\frac{1}{p(x)}) = H(X,Y)- \sum \limits_{x}p(x)\log (\frac{1}{p(x)})$

and this is just the entropy, and so

$H(Y|X)=H(X,Y)-H(X)$,

as required.

Mutual Information

Similarly to the KL-divergence, we can test how much this inequality is observed, defining

$I(X,Y) = H(X)- H(X|Y)$,

which we call the mutual information of $X$ and $Y$.

In fact the analogy between mutual information and KL-divergence can be made precise if we write $p_{X,Y}$ for the joint density and $p_X, p_Y$ for the respective marginal densities, we have

$I(X,Y) = d_{KL}(p_{X,Y}\Vert p_X \cdot p_Y)$,

to prove this is not hard and so I won’t bother. Notice that Gibb’s inequality then implies what we guessed before, that conditioning does not lose information (or increase surprise).

What is the interpretation? Well $p_{X,Y}=p_X\cdot p_Y$ if and only if $X$ and $Y$ are independent. So we are measuring the how much more surprised we will be if we assume that $X,Y$ are independent, alternatively, how much information is lost by modelling $X,Y$ as being independent, so it is  a measure of dependence.

Fascinating! It is interesting to compare this with the correlation of $X$ and $Y$. This is also a measure of dependence, recording the extent to which large values of $X$ correspond to large values of $Y$ for positive correlation, or large values of $X$ correspond to small (negative) values of $Y$. It is possible to be uncorrelated but be dependent.

By contrast mutual information is totally oblivious to large and small values, it cares only about the probabilities and so in my opinion is much more beautiful. What’s more, by Gibb’s inequality, the mutual information is zero if and only if $X,Y$ are independent! So it is a purely informational measure of dependency, in that it fails to encode anything about how the dependency is seen from the perspective of the metric. So it really depends on the context which is more useful. For making estimates of random variables, the values are important, and so the correlation or covariance will likely be more useful.

An Example: Guess the Number

To illustrate the usefulness of entropy we will play a game. I will pick a number $y$ uniformly at random from the unit interval $I=[0,1]$.

You can then make a guess $x \in [0,1]$, and I will tell you whether $x \le y$ or $x > y$. After I tell you, you have narrowed down the location of the number to $[x,1]$ if $x \le y$ or $[0,x]$ if $x>y$.

If the aim is to minimize the expected length of the interval after a guess, we should guess $x=\frac{1}{2}$. This is very intuitive, but prove this we note that the expected length is

$(1-x)\cdot P(x \le y) + x \cdot P(x >y)$.

Since $y$ was chosen uniformly at random, $P(x \le y)=1-x$ and $P(x>y)=x$. So the expected length is

$L(x):= x^2 + (1-x)^2$.

Note that $L(x) = 2x^2 -2x +1$ and so $L'(x)=4x-2$, and $L''(x) = 4$. Therefore any extrema are minima, and since $x=1/2$ is the unique solution to $L'(x)=0$, we know that guessing $x=1/2$ minimizes the expected length of the interval.

Now if $S_x$ is the score I give  your guess,  a random variable taking values $\{0,1\}$ so that $P(S_x=1)=x$, then we have already seen that to maximize the entropy of $H(S_x)$ we should have each outcome be equally likely and so take $x=1/2$. So the idea is when performing an experiment, or asking a question, the better you are able to predict the answer, the less you learn, and vice versa. So we see that maximizing our entropy is the best strategy as we’ve proved using calculus.

What about if we are allowed to make $n$ guesses $x_1, \dots ,x_n$? If we choose to maximize our entropy with each guess, guessing halfway along each interval, then we will have narrowed $y$ down to an interval of width $2^{-n}$, but is this still the best strategy? Intuitively it does still seem like the best plan, but it isn’t quite as obvious as in the case of 1 guess.

Any deterministic strategy (that doesn’t repeat guesses!) can be described as narrowing $y$ down to $2^n$ intervals of lengths $p:=(p_1, \dots , p_{2^n})$ summing to $1$, and the probability of the strategy ending with an interval of length $p_i$ is $p_i$. The expected length of an interval is then $L_p =\sum \limits _{i=1}^{2^n}p_i^2 = = \Vert p \Vert^2$.

This suggests a geometric interpretation, what we are looking for is the point on the standard simplex $\triangle_{2^n-1}$ closest to the origin. Now if you read about projecting onto the standard simplex, you will see that we can immediately conclude that we should take $p=(2^{-n}, \dots, 2^{-n})$.

If you don’t believe that, what you might believe using your geometric intuition is that we are looking for a sphere $S_r$ around the origin with radius $r$ so that the sphere is tangent to the simplex, touching it at a single point. In the 1 guess case, this looks like this:

the sphere in this case is just a circle, shown in blue. The constraint set is shown in red, the set of points with nonnegative coordinates summing to one. You can see that if you start with a small circle and expand until it hits the red line segment, the circle will hit it at the point $(\frac{1}{2},\frac{1}{2})$.

So we are minimizing a function, the radius, subject to a constraint that the function be tangent to the set of points $q$ satisfying $\sum \limits _{i=1}^{2^n}q_i =1, \ q_i \ge 0$. This is precisely the situation encountered in the method of Lagrange multipliers, if you want to understand this idea I recommend checking out this blog post, Lagrangians for the Amnesiac.

In standard notation for this technique, let $f(x) = $ and $g(x) = -1 + \sum \limits _{i=1}^{2^n}x_i$. So we want to minimize $f$ subject to $g(x)=0$. The method implies that we then looking for a $\lambda$ so that

$\nabla f - \lambda \nabla g = 0$,

which is precisely what the geometry tells us – that $f$ needs to be tangent to the standard simplex.

Solving this gives

$2x_i - \lambda =0$ for each $i=1, \dots, 2^n$,

and so we must have

$x_i = \frac{\lambda}{2}$ for each $i$,

implying that $x_i$ is the same for each $i$, which forces $x_i=2^{-n}$.

So yet again the best policy is to maximize the entropy at each turn! I invite you to think about stochastic strategies.

Next Time

In the future we might look at how entropy is related to compression, or how the KL-divergence can be used to induce sparsity in an autoencoder.

## Neural Networks Part 2: Python Implementation

Ok so last time we introduced the feedforward neural network. We discussed how input gets fed forward to become output, and the backpropagation algorithm for learning the weights of the edges.

Today we will begin by showing how the model can be expressed using matrix notation, under the assumption that the neural network is fully connected, that is each neuron is connected to all the neurons in the next layer.

Once this is done we will give a Python implementation and test it out.

Matrix Notation For Neural Networks

Most of  this I learned from here.

In what follows, vectors are always thought of as columns, and so the transpose a row.

So first off we have $x \in \mathbb{R}^k$, our input vector, and $y \in \mathbb{R}^m$ our output vector.

Our neural network $N$ will have $N$ layers $L_1, \dots , L_n$ with $L_1$   the input layer, and $L_n$ the output layer.

For each layer $L_i$:

• write $n_i:=|L_i|$ for the number of neurons in that layer (not including the bias neurons);
• write $f_i: \mathbb{R}^{n_i} \to \mathbb{R}^{n_i}$ for the layer’s activation function;
• write $f_i': \mathbb{R}^{n_i} \to \mathbb{R}^{n_i}$ for vector of derivatives of the activation functions;
• write $I_i \in \mathbb{R}^{n_i}$ for the vector of inputs to the $ith$ layer;
• write $b_i \in \mathbb{R}^{n_{i}}$ for the vector of biases;
• write $O_i \in \mathbb{R}^{n_i}= f_i(I_i)$ for the vector of outputs from the $ith$ layer;
• write $O_i' \in \mathbb{R}^{n_i}= f_i'(I_i)$;
• write $W_i$ for the $n_{i+1} \times n_i$ real matrix containing the weights of the edges going from $L_i \to L_{i+1}$ (not including bias neurons), so that $W_iO_i + b_i = I_{i+1}$.

Feeding forward in vector notation

So to feedforward the input we have

$x \mapsto W_1x +b_1 \mapsto f_2(W_1x+b_1) \mapsto W_2f_2(W_1x)+b_2 \mapsto \dots \mapsto f_n(W_{n-1} f_{n-1}( \dots W_2f_2(W_1x) ))$.

And we call this output $f_N(x)$.

Backpropagation in vector notation

Now write $E_i \in \mathbb{R}^{n_i}$ for the error at the layer $i$ defined $E_i:=\frac{\partial e}{\partial I_i}$.

Then $E_n = O_n' \cdot (O_n-y)$, and $E_{i-1} = \left (W_{i-1}^TE_i\right ) \cdot O_{i-1}'$, where the multiplication $\cdot$ is component wise.

Write $\triangle W_i$ for the matrix with entries $(\frac{\partial e}{\partial w_{ij}})_{ij}$, for non-bias weights.

Then we have $\triangle W_i = \cdot E_{i+1} \cdot O_i^T$, the outer product of $E_{i+1}$ and $O_i$.

And for the biases write $\triangle b_i \in \mathbb{R}^{n_i}$ for the vector with $jth$ component $\frac{\partial e}{\partial (b_i)_j}$.

Then $\triangle b_i =E_{i+1}$.

Python Implementation

We will create a class NeuralNetwork and perform our calculations using matrices in numpy.

import numpy as np

class NeuralNetwork(object):
def __init__(self, X, y, parameters):
#Input data
self.X=X
#Output data
self.y=y
#Expect parameters to be a tuple of the form:
#    ((n_input,0,0), (n_hidden_layer_1, f_1, f_1'), ...,
#     (n_hidden_layer_k, f_k, f_k'), (n_output, f_o, f_o'))
self.n_layers = len(parameters)
#Counts number of neurons without bias neurons in each layer.
self.sizes = [layer[0] for layer in parameters]
#Activation functions for each layer.
self.fs =[layer[1] for layer in parameters]
#Derivatives of activation functions for each layer.
self.fprimes = [layer[2] for layer in parameters]
self.build_network()

def build_network(self):
#List of weight matrices taking the output of one layer to the input of the next.
self.weights=[]
#Bias vector for each layer.
self.biases=[]
#Input vector for each layer.
self.inputs=[]
#Output vector for each layer.
self.outputs=[]
#Vector of errors at each layer.
self.errors=[]
#We initialise the weights randomly, and fill the other vectors with 1s.
for layer in range(self.n_layers-1):
n = self.sizes[layer]
m = self.sizes[layer+1]
self.weights.append(np.random.normal(0,1, (m,n)))
self.biases.append(np.random.normal(0,1,(m,1)))
self.inputs.append(np.zeros((n,1)))
self.outputs.append(np.zeros((n,1)))
self.errors.append(np.zeros((n,1)))
#There are only n-1 weight matrices, so we do the last case separately.
n = self.sizes[-1]
self.inputs.append(np.zeros((n,1)))
self.outputs.append(np.zeros((n,1)))
self.errors.append(np.zeros((n,1)))

def feedforward(self, x):
#Propagates the input from the input layer to the output layer.
k=len(x)
x.shape=(k,1)
self.inputs[0]=x
self.outputs[0]=x
for i in range(1,self.n_layers):
self.inputs[i]=self.weights[i-1].dot(self.outputs[i-1])+self.biases[i-1]
self.outputs[i]=self.fs[i](self.inputs[i])
return self.outputs[-1]

def update_weights(self,x,y):
#Update the weight matrices for each layer based on a single input x and target y.
output = self.feedforward(x)
self.errors[-1]=self.fprimes[-1](self.outputs[-1])*(output-y)

n=self.n_layers-2
for i in xrange(n,0,-1):
self.errors[i] = self.fprimes[i](self.inputs[i])*self.weights[i].T.dot(self.errors[i+1])
self.weights[i] = self.weights[i]-self.learning_rate*np.outer(self.errors[i+1],self.outputs[i])
self.biases[i] = self.biases[i] - self.learning_rate*self.errors[i+1]
self.weights[0] = self.weights[0]-self.learning_rate*np.outer(self.errors[1],self.outputs[0])
self.biases[0] = self.biases[0] - self.learning_rate*self.errors[1]
def train(self,n_iter, learning_rate=1):
#Updates the weights after comparing each input in X with y
#repeats this process n_iter times.
self.learning_rate=learning_rate
n=self.X.shape[0]
for repeat in range(n_iter):
#We shuffle the order in which we go through the inputs on each iter.
index=list(range(n))
np.random.shuffle(index)
for row in index:
x=self.X[row]
y=self.y[row]
self.update_weights(x,y)

def predict_x(self, x):
return self.feedforward(x)

def predict(self, X):
n = len(X)
m = self.sizes[-1]
ret = np.ones((n,m))
for i in range(len(X)):
ret[i,:] = self.feedforward(X[i])
return ret

And we’re done! Now to test it we generate some synthetic data and supply the network with some activation functions.

def logistic(x):
return 1.0/(1+np.exp(-x))

def logistic_prime(x):
ex=np.exp(-x)
return ex/(1+ex)**2

def identity(x):
return x

def identity_prime(x):
return 1

First we will try to get it to approximate a sine curve.

#expit is a fast way to compute logistic using precomputed exp.
from scipy.special import expit
def test_regression(plots=False):
#First create the data.
n=200
X=np.linspace(0,3*np.pi,num=n)
X.shape=(n,1)
y=np.sin(X)
#We make a neural net with 2 hidden layers, 20 neurons in each, using logistic activation
#functions.
param=((1,0,0),(20, expit, logistic_prime),(20, expit, logistic_prime),(1,identity, identity_prime))
#Set learning rate.
rates=[0.05]
predictions=[]
for rate in rates:
N=NeuralNetwork(X,y,param)
N.train(4000, learning_rate=rate)
predictions.append([rate,N.predict(X)])
import matplotlib.pyplot as plt
fig, ax=plt.subplots(1,1)
if plots:
ax.plot(X,y, label='Sine', linewidth=2, color='black')
for data in predictions:
ax.plot(X,data[1],label="Learning Rate: "+str(data[0]))
ax.legend()
test_regression(True)

When I ran this it produced the following:

Next we will try a classification problem with a nonlinear decision boundary, how about being above and below the sine curve?

def test_classification(plots=False):
#Number samples
n=700

n_iter=1500
learning_rate=0.05

#Samples for true decision boundary plot
L=np.linspace(0,3*np.pi,num=n)
l = np.sin(L)

#Data inputs, training
X = np.random.uniform(0, 3*np.pi, size=(n,2))
X[:,1] *= 1/np.pi
X[:,1]-= 1

#Data inputs, testing
T = np.random.uniform(0, 3*np.pi, size=(n,2))
T[:,1] *= 1/np.pi
T[:,1] -= 1

#Data outputs
y = np.sin(X[:,0]) <= X[:,1]

#Fitting
param=((2,0,0),(30, expit, logistic_prime),(30, expit, logistic_prime),(1,expit, logistic_prime))
N=NeuralNetwork(X,y, param)
#Training
N.train(n_iter, learning_rate)
predictions_training=N.predict(X)
predictions_training= predictions_training <0.5
predictions_training= predictions_training[:,0]
#Testing
predictions_testing=N.predict(T)
predictions_testing= predictions_testing <0.5
predictions_testing= predictions_testing[:,0]

#Plotting
import matplotlib.pyplot as plt
fig, ax=plt.subplots(2,1)

#Training plot
#We plot the predictions of the neural net blue for class 0, red for 1.
ax[0].scatter(X[predictions_training,0], X[predictions_training,1], color='blue')
not_index = np.logical_not(predictions_training)
ax[0].scatter(X[not_index,0], X[not_index,1], color='red')
ax[0].set_xlim(0, 3*np.pi)
ax[0].set_ylim(-1,1)
#True decision boundary
ax[0].plot(L,l, color='black')
#Shade the areas according to how to they should be classified.
ax[0].fill_between(L, l,y2=-1, alpha=0.5)
ax[0].fill_between(L, l, y2=1, alpha=0.5, color='red')

#Testing plot
ax[1].scatter(T[predictions_testing,0], T[predictions_testing,1], color='blue')
not_index = np.logical_not(predictions_testing)
ax[1].scatter(T[not_index,0], T[not_index,1], color='red')
ax[1].set_xlim(0, 3*np.pi)
ax[1].set_ylim(-1,1)
ax[1].plot(L,l, color='black')
ax[1].fill_between(L, l,y2=-1, alpha=0.5)
ax[1].fill_between(L, l, y2=1, alpha=0.5, color='red')

test_classification()

When I ran this it looked like this.

The top plot shows the performance on the training data, and the bottom the performance of on the testing data. The points are colored according to the net’s predictions, whereas the areas are shaded according to the true classifications.

As you can see neural networks are capable of giving very good models, but the number of iterations and hidden nodes may be large. Less iterations are possible using good configurations of the network i terms of sizes and numbers of hidden layers, and choosing the learning rate well, but it is difficult to know how to choose these well.

It ended up being quicker to choose a larger number of hidden nodes, a large number of iterations, and a small learning rate, than to experiment finding a good choice. In the future I may write about how to use momentum to speed up training.

The code for this post is available here, please comment if anything is unclear or incorrect!