Recently I have finally upgraded my ancient laptop. Other than the ability for me to play the occasional video game, this means that I now have a dedicated Nvidia graphics card, in particular one that supports something called CUDA.

Now that I have the hardware, I thought that it would be fun to check out this python library theano. As I understand it, it has two very interesting aspects.

One is that it does symbolic computation, that is it represents variables as in mathematics, and their manipulations.

A variable in computer science is something like a integer or a string or a list etc…, a piece of information stored somewhere in memory with a name made up by the programmer, and the compiler takes care of associating the human understandable name to the address of the information in memory.

A variable in mathematics is like in , an indeterminate quantity that can be manipulated using arithmetic operations, or using calculus and so on. So what in mathematics we call a variable corresponds roughly to a function in computer science.

The difference between symbolic computation and its variables, and regular functions, is that the programming language is aware of the mathematical nature of a variable, such that when you apply operations to variables to create new variables, it does this in an efficient way. The best way to understand this is a simple example. Let’s suppose we have two functions . Now we can combine these functions by adding them to make a new function .

Now in python say, when you call with an argument, it will pass this argument to both and and then add the result and return it.

If we are doing symbolic computation, and were variables with , then the variable constantly. And so the computer does no work at all when using , because it knows that it must always be zero.

So one important aspect is that symbolic computation can simplify and hence often optimize functions automatically.

Another is that theano can automagically differentiate variables with respect to variables! This might not sound like a big deal, but it is. In machine learning, we often create a complicated function, like in logistic regression or neural networks, and then derive a helpful form for the derivative to do gradient descent to optimize it. In theano, *all you have to do is specify the objective function*, and it will let you do gradient descent automatically.

The second amazing thing about theano is that it compiles its theano functions into C, and so it often runs much much faster than native python or numpy. It can also make it easier to run computations on a gpu which can mean a 100x speedup for the right cases!

Hopefully I’ve made you curious, so click here for an installation guide. On this linux machine it was very simple. I am dual booting windows 8.1 and Ubuntu at the moment, it is very tricky to get the gpu aspect to work on windows, but here is a guide for the brave. To check that the GPU aspect is working, run the test code here.

**Introduction to Theano**

I recommend that you read the tutorials here, and also this excellent ipython worksheet here.

Let’s little play with the basic functionality.

import theano as th from theano import tensor as T #To make a variable, specify the type of variable from the tensor #library. #We can have scalars. x=T.dscalar(name='scalar_x') #Note that the name is just a string that theano will use #to communicate to us about the variable. print x print x.type #We can also have vectors. v = T.dvector(name='vector_v') print v print v.type #And matrices. A = T.dmatrix(name='matrix_A') print A print A.type #We can also make new variables using standard mathematical operations. x_2 = x*x print x_2 print x_2.type #We can also make python functions which return new variables. def power(variable, n): return variable**n x_10 = power(x,10) print x_10 print x_10.type #We can of course do standard linear algebra operations also. Av = T.dot(A,v) print Av print Av.type #Of course when a variable is actually evaluated, you must #ensure that the dimensions are correct, or you will #get an error. #To see the value of a variable for a particular value of the variables #comprising it, we make a theano function. f = th.function([A,v], [Av]) #The syntax is a list of input variables, and then a list of output variables. #The python code takes a little longer to run initially, because thenao #compiles the function into C, but thereafter it will run extremely fast. #Let's try using the function. import numpy as np m=np.ones((2,2)) print m vec = np.asarray([1,2]) print vec print f(m,vec) #Now we can try computing gradients. #First let's make a scalar variable by taking an inner product. w=T.dvector(name='w_vector') vTw = T.dot(v,w) #Now we take the gradient with respect to w. vTw_grad = T.grad(vTw,w) print vTw_grad print vTw_grad.type #Now let's test it. vec1 = np.asarray([1,2]) vec2 = np.asarray([0,0]) #To evaulate a variable given inputs, there is another syntax #in additon to creating a thenano function. print vTw_grad.eval({w:vec1,v:vec2}) #Next we discuss theano's shared variables, #these differ from regular theano variables in that #theano variables only have a value within a theano function #whereas shared theano variables have a value independent of #being called in a function. w=T.shared(name='shared_matrix', value=np.ones((2,3))) print w print w.type print w.get_value() #You can also set a shared variable's value. w.set_value(np.zeros((2,3))) print w.get_value() #You can also have theano functions update shared variables, #we illustrate this with a silly example that updates a matrix #like in gradient descent. x=T.dvector('x') wx=T.dot(w,x) cost = wx.mean() cost_grad = T.grad(cost, w) f=th.function(inputs=[x], outputs=[wx, cost], updates=[(w, w-0.1*cost_grad)]) #Notice the syntax of updates argument, should be a list #of two tuples of the form: (variable_to_be_updated, updated_variable). print f([1,1,1]) print w.get_value()

If you want to tinker with this, get the sourcecode here.

**Autoencoders**

I don’t want to go into too much detail here, but an autoencoder is a feedforward neural network with a single hidden layer with the twist being that its target output is exactly the same as its input. If the number of hidden neurons is less than the number of input/output neurons, then the network will be forced to learn some sort of compression.

Even when the number of hidden neurons is large, there are various tricks in cost functions you can use to ensure that the network does something more interesting than learning the identity function.

The hope of an autoencoder is that it learns some structure in the data, so that the hidden layer can be thought of as a feature space that one can use for a supervised learning algorithm.

Basically then we have the hidden layer, then , where are bias vectors, are matrices, and are element-wise non-linearities like the logistic function.

A simplification that is often made is to enforce . This gives the model less free parameters to learn, and so is a form of regularization. It also uses less memory. This is referred to as an autoencoder with tied weights.

**Autoencoders with Theano**

Thanks to automatic differentiation, backprogation is effortless with theano, and thanks to gpu computation, much faster also! I recommend having a look at my python implementation of backpropagation just so you can see the effort saved. This implementation benefited greatly from this tutorial.

import numpy as np import theano as th from theano import tensor as T from numpy import random as rng class AutoEncoder(object): def __init__(self, X, hidden_size, activation_function, output_function): #X is the data, an m x n numpy matrix #where rows correspond to datapoints #and columns correspond to features. assert type(X) is np.ndarray assert len(X.shape)==2 self.X=X self.X=th.shared(name='X', value=np.asarray(self.X, dtype=th.config.floatX),borrow=True) #The config.floatX and borrow=True stuff is to get this to run #fast on the gpu. I recommend just doing this without thinking about #it until you understand the code as a whole, then learning more #about gpus and theano. self.n = X.shape[1] self.m = X.shape[0] #Hidden_size is the number of neurons in the hidden layer, an int. assert type(hidden_size) is int assert hidden_size > 0 self.hidden_size=hidden_size initial_W = np.asarray(rng.uniform( low=-4 * np.sqrt(6. / (self.hidden_size + self.n)), high=4 * np.sqrt(6. / (self.hidden_size + self.n)), size=(self.n, self.hidden_size)), dtype=th.config.floatX) self.W = th.shared(value=initial_W, name='W', borrow=True) self.b1 = th.shared(name='b1', value=np.zeros(shape=(self.hidden_size,), dtype=th.config.floatX),borrow=True) self.b2 = th.shared(name='b2', value=np.zeros(shape=(self.n,), dtype=th.config.floatX),borrow=True) self.activation_function=activation_function self.output_function=output_function def train(self, n_epochs=100, mini_batch_size=1, learning_rate=0.1): index = T.lscalar() x=T.matrix('x') params = [self.W, self.b1, self.b2] hidden = self.activation_function(T.dot(x, self.W)+self.b1) output = T.dot(hidden,T.transpose(self.W))+self.b2 output = self.output_function(output) #Use cross-entropy loss. L = -T.sum(x*T.log(output) + (1-x)*T.log(1-output), axis=1) cost=L.mean() updates=[] #Return gradient with respect to W, b1, b2. gparams = T.grad(cost,params) #Create a list of 2 tuples for updates. for param, gparam in zip(params, gparams): updates.append((param, param-learning_rate*gparam)) #Train given a mini-batch of the data. train = th.function(inputs=[index], outputs=[cost], updates=updates, givens={x:self.X[index:index+mini_batch_size,:]}) import time start_time = time.clock() for epoch in xrange(n_epochs): print "Epoch:",epoch for row in xrange(0,self.m, mini_batch_size): train(row) end_time = time.clock() print "Average time per epoch=", (end_time-start_time)/n_epochs def get_hidden(self,data): x=T.dmatrix('x') hidden = self.activation_function(T.dot(x,self.W)+self.b1) transformed_data = th.function(inputs=[x], outputs=[hidden]) return transformed_data(data) def get_weights(self): return [self.W.get_value(), self.b1.get_value(), self.b2.get_value()]

And that is it! Now we should test this puppy out, so let’s find some data.

**MNIST**

The MNIST data is famous in machine learning circles, it consists of single handwritten digits. Nearly every paper written on neural networks and so on tests their contribution on the data. The task is to classify the digits, but we will just test our autoencoder. First we will steal a very nice helper function to load the data set, this will download it for you if you don’t have it already. You can find it here.

import cPickle import gzip import os

def load_data(dataset): ''' Loads the dataset :type dataset: string :param dataset: the path to the dataset (here MNIST) ''' ############# # LOAD DATA # ############# # Download the MNIST dataset if it is not present data_dir, data_file = os.path.split(dataset) if data_dir == "" and not os.path.isfile(dataset): # Check if dataset is in the data directory. new_path = os.path.join(os.path.split(__file__)[0], "..", "data", dataset) if os.path.isfile(new_path) or data_file == 'mnist.pkl.gz': dataset = new_path if (not os.path.isfile(dataset)) and data_file == 'mnist.pkl.gz': import urllib origin = 'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz' print 'Downloading data from %s' % origin urllib.urlretrieve(origin, dataset) print '... loading data' # Load the dataset f = gzip.open(dataset, 'rb') train_set, valid_set, test_set = cPickle.load(f) f.close() #train_set, valid_set, test_set format: tuple(input, target) #input is an numpy.ndarray of 2 dimensions (a matrix) #which row's correspond to an example. target is a #numpy.ndarray of 1 dimensions (vector)) that have the same length as #the number of rows in the input. It should give the target #target to the example with the same index in the input. return (train_set, valid_set, test_set) path="/home/harri/Dropbox/Work/Blogs/triangleinequality/Theano/data/mnist.pkl.gz" data=load_data(path)

We also want a plotting function to visualize the features learned, I won’t explain this here, but if you google “visualizing features of neural network” you can learn plenty about it.

def plot_first_k_numbers(X,k): from matplotlib import mpl,pyplot m=X.shape[0] k=min(m,k) j = int(round(k / 10.0)) fig, ax = pyplot.subplots(j,10) for i in range(k): w=X[i,:] w=w.reshape(28,28) ax[i/10, i%10].imshow(w,cmap=pyplot.cm.gist_yarg, interpolation='nearest', aspect='equal') ax[i/10, i%10].axis('off') pyplot.tick_params(\ axis='x', # changes apply to the x-axis which='both', # both major and minor ticks are affected bottom='off', # ticks along the bottom edge are off top='off', # ticks along the top edge are off labelbottom='off') pyplot.tick_params(\ axis='y', # changes apply to the x-axis which='both', # both major and minor ticks are affected left='off', right='off', # ticks along the top edge are off labelleft='off') fig.show()

Now we put it together.

def m_test(data): X=data[0][0] activation_function = T.nnet.sigmoid output_function=activation_function A = AutoEncoder(X, 500, activation_function, output_function) A.train(20,20) W=np.transpose(A.get_weights()[0]) plot_first_k_numbers(W, 100, False) plot_first_k_numbers(W,100, True) m_test(data)

Running this, with 500 hidden units for 20 epochs with minibatch size of 20, the following features were learned:

Nothing too interesting yet, in order to get more useful features we could change the cost function to encourage sparsity, or introduce noise into the inputs, or use a different activation function. Hopefully this has been helpful for those just starting out with theano. Here is the source code.

[…] Theano, Autoencoders and MNIST […]

[…] iteration over tensors and recurrences. If you are a complete beginner to theano, check out my previous post. It seemed a bit obscure when I first started trying to understand it, so I thought I would share […]