Last time we looked at the perceptron model for classification, which is similar to logistic regression, in that it uses a nonlinear transformation of a linear combination of the input features to give an output class.
A neural network generalizes this idea by conceiving of a network of nodes which each take as an input a linear combination of the outputs of other nodes, then use some activation function to turn this into an output, such as:
- the sigmoidal logistic function, as in logistic regression,
- the binary step function seen in the perceptron,
- the identity function as seen in linear regression,
- the sigmoidal .
A Feedforward Neural Network
Each node is called a neuron. There are three main kinds of neurons:
- Input neurons, one neuron for each feature, so if our input were vectors in , we would have 3 input neurons. These neurons have trivial activation functions and no incoming edges.
- Output neurons, one neuron for each output feature, so if our so if our output were vectors in , we would have 3 output neurons. These will typically have a nonlinear activation function, chosen so that the output range matches our training data. These neurons have no outgoing edges.
- Hidden neurons are in between input and output neurons and have both incoming and outgoing edges.
Since the graph is acyclic, we can think of it as being composed of layers, as in the following picture taken from wikipedia:
although note there can be many hidden layers. Each edge has a weight, a real number.
The neural network makes a prediction by feeding forward the input. To describe this we develop some notation. The discussion that follows is indebted to this very helpful page.
Let be a neural network and index the vertices of so that each is a vertex for .
For each let denote the set of indices of the targets of . That is the set of indices such for each there exists a directed edge from . We may also conflate with the set of indexed neurons if the context makes this unambiguous.
Similarly, for each let denote the set of indices of the sources of .
For each index , let denote the activation function of .
For each index , let be the output of . When the argument is understood, we may simply write .
For each directed edge define to be the weight of the edge.
Where the outputs are understood, for each index , define the input to be , the weighted sum of the outputs of the previous layer.
Feeding forward refers to how an input becomes an output or prediction.
As you might have guessed: feeding forward an input vector , the input and output to each input neuron is just the corresponding component of the vector .
From there the input to a neuron in the first hidden layer is simply the sum of the components of , weighted by the weights of the incoming edges to that hidden node.
You repeat this for each hidden node. Then the inputs to the hidden nodes are transformed via their activation functions to their outputs. The weighted sums of these outputs are then passed on as inputs to the next layer, and so on.
This is probably best understood through a simple example.
Here we have 4 neurons. The input is two dimensional, so we have two input neurons shown in blue. We have a single hidden layer with just one hidden neuron, shown in green. Finally we have a single output neuron shown in orange. Hence the output is one dimensional.
Choosing the Weights: Backpropagation
So we have seen how a neural network can make predictions, but how can we choose the weights so that the predictions are any good? The answer is the same as in logistic regression, we use gradient descent. In order to do this we need to use differentiable activation functions and choose an appropriate error function to minimize.
If is the input vector, and is the target vector, let be the prediction of the neural network. Then we define the error to be . This is the natural Euclidean norm of the difference between the target and the prediction. The factor of a half is there for aesthetical reasons, since we will be differentiating and gaining a factor of 2.
The trick to the algorithm is to figure out how the weights contribute to the error, and backpropagation is an elegant way of assigning credit/blame to each node and edge. We fix an input , and write for , and for each each edge we are seeking . Once we have found this, we set , where is the learning rate. So we are doing stochastic gradient descent as in the perceptron.
It is clear that an edge and its weight contribute to the error through the target ‘s input. So it makes sense to divide and conquer using the chain rule, that is
The first part of the product is worth giving a name, we denote it by and call it the error at the node.
The second part of the product is simple to calculate:
Returning to , we can calculate it recursively, assuming the errors for the neurons in have already been calculated.
Observe that can be thought of as a function of the variables for indexing any layer, and so using the chain rule we have
notice that if we assume that each layer is fully-connected to the next, then we could write the above succinctly as an inner product, and inner products can be calculated very efficiently, for instance using a GPU, so it is often faster to have more connections!
Together this gives
Since we have calculated the error at a node recursively, we had better deal with the base case – the output neurons. But this follows simply from the definition of to be , with the factor of 2 cancelling with the half as promised.
The final piece of the puzzle are called bias neurons, which I had left out before to simplify the discussion. Each layer, except the output layer, has a bias neuron. A bias neuron has no inputs, and its output is always 1. Effectively it just adds the weight of its outgoing edges to the inputs of the next layer.
These behave just like regular neurons, only when updating their weights, we note that since their input is effectively constant, they have , and so we need only calculate the error at a bias neuron to update its weights.
In the next installment I will show how to compactly represent the process of feeding forward and backpropagation using vector and matrix notation assuming the layers are fully connected. We will then use this to give a Python implementation.