Home » machine learning » Theano, Autoencoders and MNIST

# Theano, Autoencoders and MNIST

Recently I have finally upgraded my ancient laptop. Other than the ability for me to play the occasional video game, this means that I now have a dedicated Nvidia graphics card, in particular one that supports something called CUDA.

Now that I have the hardware, I thought that it would be fun to check out this python library theano. As I understand it, it has two very interesting aspects.

One is that it does symbolic computation, that is it represents variables as in mathematics, and their manipulations.

A variable in computer science is something like a integer or a string or  a list etc…, a piece of information stored somewhere in memory with a name made up by the programmer, and the compiler takes care of associating the human understandable name to the address of the information in memory.

A variable in mathematics is like $x$ in $x^2 + x + 1$, an indeterminate quantity that can be manipulated using arithmetic operations, or using calculus and so on. So what in mathematics we call a variable corresponds roughly to a function in computer science.

The difference between symbolic computation and its variables, and regular functions, is that the programming language is aware of the mathematical nature of a variable, such that when you apply operations to variables to create new variables, it does this in an efficient way. The best way to understand this is a simple example. Let’s suppose we have two functions $f(x) =x, g(x) = -x$. Now we can combine these functions by adding them to make a new function $h(x) = f(x) + g(x)$.

Now in python say, when you call $h$ with an argument, it will pass this argument to both $f$ and $g$ and then add the result and return it.

If we are doing symbolic computation, and $f,g$ were variables with $f=-g$, then the variable $h=f+g=0$ constantly. And so the computer does no work at all when using $h$, because it knows that it must always be zero.

So one important aspect is that symbolic computation can simplify and hence often optimize functions automatically.

Another is that theano can automagically differentiate variables with respect to variables! This might not sound like a big deal, but it is. In machine learning, we often create a complicated function, like in logistic regression or neural networks, and then derive a helpful form for the derivative to do gradient descent to optimize it. In theano, all you have to do is specify the objective function, and it will let you do gradient descent automatically.

The second amazing thing about theano is that it compiles its theano functions into C, and so it often runs much much faster than native python or numpy. It can also make it easier to run computations on a gpu which can mean a 100x speedup for the right cases!

Hopefully I’ve made you curious, so  click here for an installation guide. On this linux machine it was very simple. I am dual booting windows 8.1 and Ubuntu at the moment, it is very tricky to get the gpu aspect to work on windows, but here is a guide for the brave. To check that the GPU aspect is working, run the test code here.

Introduction to Theano

I recommend that you read the tutorials here, and also this excellent ipython worksheet here.

Let’s little play with the basic functionality.

```import theano as th
from theano import tensor as T

#To make a variable, specify the type of variable from the tensor
#library.

#We can have scalars.
x=T.dscalar(name='scalar_x')

#Note that the name is just a string that theano will use
#to communicate to us about the variable.
print x
print x.type
#We can also have vectors.
v = T.dvector(name='vector_v')
print v
print v.type
#And matrices.
A = T.dmatrix(name='matrix_A')
print A
print A.type

#We can also make new variables using standard mathematical operations.
x_2 = x*x
print x_2
print x_2.type

#We can also make python functions which return new variables.
def power(variable, n):
return variable**n
x_10 = power(x,10)
print x_10
print x_10.type

#We can of course do standard linear algebra operations also.
Av = T.dot(A,v)
print Av
print Av.type

#Of course when a variable is actually evaluated, you must
#ensure that the dimensions are correct, or you will
#get an error.

#To see the value of a variable for a particular value of the variables
#comprising it, we make a theano function.
f = th.function([A,v], [Av])
#The syntax is a list of input variables, and then a list of output variables.

#The python code takes a little longer to run initially, because thenao
#compiles the function into C, but thereafter it will run extremely fast.

#Let's try using the function.
import numpy as np
m=np.ones((2,2))
print m
vec = np.asarray([1,2])
print vec
print f(m,vec)

#Now we can try computing gradients.
#First let's make a scalar variable by taking an inner product.
w=T.dvector(name='w_vector')
vTw = T.dot(v,w)

#Now we take the gradient with respect to w.

#Now let's test it.
vec1 = np.asarray([1,2])
vec2 = np.asarray([0,0])

#To evaulate a variable given inputs, there is another syntax
#in additon to creating a thenano function.

#Next we discuss theano's shared variables,
#these differ from regular theano variables in that
#theano variables only have a value within a theano function
#whereas shared theano variables have a value independent of
#being called in a function.
w=T.shared(name='shared_matrix', value=np.ones((2,3)))
print w
print w.type
print w.get_value()

#You can also set a shared variable's value.
w.set_value(np.zeros((2,3)))
print w.get_value()

#You can also have theano functions update shared variables,
#we illustrate this with a silly example that updates a matrix
x=T.dvector('x')
wx=T.dot(w,x)
cost = wx.mean()

#Notice the syntax of updates argument, should be a list
#of two tuples of the form: (variable_to_be_updated, updated_variable).
print f([1,1,1])
print w.get_value()
```

If you want to tinker with this, get the sourcecode here.

Autoencoders

I don’t want to go into too much detail here, but an autoencoder is a feedforward neural network with a single hidden layer with the twist being that its target output is exactly the same as its input. If the number of hidden neurons is less than the number of input/output neurons, then the network will be forced to learn some sort of compression.

Even when the number of hidden neurons is large, there are various tricks in cost functions you can use to ensure that the network does something more interesting than learning the identity function.

The hope of an autoencoder is that it learns some structure in the data, so that the hidden layer can be thought of as a feature space that one can use for a supervised learning algorithm.

Basically then we have $x \mapsto f_1(W_1x+b_1) =h$ the hidden layer, then $h \mapsto f_2(W_2h+b_2)$, where $b_1, b_2$ are bias vectors, $W_1, W_2$ are matrices, and $f_1, f_2$ are element-wise non-linearities like the logistic function.

A simplification that is often made is to enforce $W_1 = W_2^T$. This gives the model less free parameters to learn, and so is a form of regularization. It also uses less memory. This is referred to as an autoencoder with tied weights.

Autoencoders with Theano

Thanks to automatic differentiation, backprogation is effortless with theano, and thanks to gpu computation, much faster also! I recommend having a look at my python implementation of backpropagation just so you can see the effort saved. This implementation benefited greatly from this tutorial.

```import numpy as np
import theano as th
from theano import tensor as T
from numpy import random as rng

class AutoEncoder(object):
def __init__(self, X, hidden_size, activation_function,
output_function):
#X is the data, an m x n numpy matrix
#where rows correspond to datapoints
#and columns correspond to features.
assert type(X) is np.ndarray
assert len(X.shape)==2
self.X=X
self.X=th.shared(name='X', value=np.asarray(self.X,
dtype=th.config.floatX),borrow=True)
#The config.floatX and borrow=True stuff is to get this to run
#fast on the gpu. I recommend just doing this without thinking about
#it until you understand the code as a whole, then learning more
self.n = X.shape[1]
self.m = X.shape[0]
#Hidden_size is the number of neurons in the hidden layer, an int.
assert type(hidden_size) is int
assert hidden_size > 0
self.hidden_size=hidden_size
initial_W = np.asarray(rng.uniform(
low=-4 * np.sqrt(6. / (self.hidden_size + self.n)),
high=4 * np.sqrt(6. / (self.hidden_size + self.n)),
size=(self.n, self.hidden_size)), dtype=th.config.floatX)
self.W = th.shared(value=initial_W, name='W', borrow=True)
self.b1 = th.shared(name='b1', value=np.zeros(shape=(self.hidden_size,),
dtype=th.config.floatX),borrow=True)
self.b2 = th.shared(name='b2', value=np.zeros(shape=(self.n,),
dtype=th.config.floatX),borrow=True)
self.activation_function=activation_function
self.output_function=output_function

def train(self, n_epochs=100, mini_batch_size=1, learning_rate=0.1):
index = T.lscalar()
x=T.matrix('x')
params = [self.W, self.b1, self.b2]
hidden = self.activation_function(T.dot(x, self.W)+self.b1)
output = T.dot(hidden,T.transpose(self.W))+self.b2
output = self.output_function(output)

#Use cross-entropy loss.
L = -T.sum(x*T.log(output) + (1-x)*T.log(1-output), axis=1)
cost=L.mean()

#Return gradient with respect to W, b1, b2.

#Create a list of 2 tuples for updates.
for param, gparam in zip(params, gparams):

#Train given a mini-batch of the data.
givens={x:self.X[index:index+mini_batch_size,:]})

import time
start_time = time.clock()
for epoch in xrange(n_epochs):
print "Epoch:",epoch
for row in xrange(0,self.m, mini_batch_size):
train(row)
end_time = time.clock()
print "Average time per epoch=", (end_time-start_time)/n_epochs

def get_hidden(self,data):
x=T.dmatrix('x')
hidden = self.activation_function(T.dot(x,self.W)+self.b1)
transformed_data = th.function(inputs=[x], outputs=[hidden])
return transformed_data(data)

def get_weights(self):
return [self.W.get_value(), self.b1.get_value(), self.b2.get_value()]
```

And that is it! Now we should test this puppy out, so let’s find some data.

MNIST

The MNIST data is famous in machine learning circles, it consists of single handwritten digits. Nearly every paper written on neural networks and so on tests their contribution on the data. The task is to classify the digits, but we will just test our autoencoder. First we will steal a very nice helper function to load the data set, this will download it for you if you don’t have it already. You can find it here.

```import cPickle
import gzip
import os
```
```

:type dataset: string
:param dataset: the path to the dataset (here MNIST)
'''

#############
#############

data_dir, data_file = os.path.split(dataset)
if data_dir == "" and not os.path.isfile(dataset):
# Check if dataset is in the data directory.
new_path = os.path.join(os.path.split(__file__)[0], "..", "data", dataset)
if os.path.isfile(new_path) or data_file == 'mnist.pkl.gz':
dataset = new_path

if (not os.path.isfile(dataset)) and data_file == 'mnist.pkl.gz':
import urllib
origin = 'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz'
urllib.urlretrieve(origin, dataset)

f = gzip.open(dataset, 'rb')
f.close()
#train_set, valid_set, test_set format: tuple(input, target)
#input is an numpy.ndarray of 2 dimensions (a matrix)
#which row's correspond to an example. target is a
#numpy.ndarray of 1 dimensions (vector)) that have the same length as
#the number of rows in the input. It should give the target
#target to the example with the same index in the input.

return (train_set, valid_set, test_set)

path="/home/harri/Dropbox/Work/Blogs/triangleinequality/Theano/data/mnist.pkl.gz"

```

We also want a plotting function to visualize the features learned, I won’t explain this here, but if you google “visualizing features of neural network” you can learn plenty about it.

```def plot_first_k_numbers(X,k):
from matplotlib import mpl,pyplot
m=X.shape[0]
k=min(m,k)
j = int(round(k / 10.0))

fig, ax = pyplot.subplots(j,10)

for i in range(k):

w=X[i,:]

w=w.reshape(28,28)
ax[i/10, i%10].imshow(w,cmap=pyplot.cm.gist_yarg,
interpolation='nearest', aspect='equal')
ax[i/10, i%10].axis('off')

pyplot.tick_params(\
axis='x',          # changes apply to the x-axis
which='both',      # both major and minor ticks are affected
bottom='off',      # ticks along the bottom edge are off
top='off',         # ticks along the top edge are off
labelbottom='off')
pyplot.tick_params(\
axis='y',          # changes apply to the x-axis
which='both',      # both major and minor ticks are affected
left='off',
right='off',    # ticks along the top edge are off
labelleft='off')

fig.show()
```

Now we put it together.

```def m_test(data):
X=data[0][0]
activation_function = T.nnet.sigmoid
output_function=activation_function
A = AutoEncoder(X, 500, activation_function, output_function)
A.train(20,20)
W=np.transpose(A.get_weights()[0])
plot_first_k_numbers(W, 100, False)
plot_first_k_numbers(W,100, True)

m_test(data)
```

Running this, with 500 hidden units for 20 epochs with minibatch size of 20, the following features were learned:

Nothing too interesting yet, in order to get more useful features we could change the cost function to encourage sparsity, or introduce noise into the inputs, or use a different activation function. Hopefully this has been helpful for those just starting out with theano. Here is the source code.