Home » machine learning » Neural Networks Part 1: Feedingforward and Backpropagation

Neural Networks Part 1: Feedingforward and Backpropagation


Last time we looked at the perceptron model for classification, which is similar to logistic regression, in that it uses a nonlinear transformation of a linear combination of the input features to give an output class.

A neural network generalizes this idea by conceiving of a network of nodes which each take as an input a linear combination of the outputs of other nodes, then use some activation function to turn this into an output, such as:

A Feedforward Neural Network

A neural network is just a directed graph. We will be looking at feedforward neural networks, which are acyclic.

Each node is called a neuron. There are three main kinds of neurons:

  • Input neurons, one neuron for each feature, so if our input were vectors in \mathbb{R}^3, we would have 3 input neurons. These neurons have trivial activation functions and no incoming edges.
  • Output neurons, one neuron for each output feature, so if our so if our output were vectors in \mathbb{R}^3, we would have 3 output neurons. These will typically have a nonlinear activation function, chosen so that the output range matches our training data. These neurons have no outgoing edges.
  • Hidden neurons are in between input and output neurons and have both incoming and outgoing edges.

Since the graph is acyclic, we can think of it as being composed of layers, as in the following picture taken from wikipedia:

although note there can be many hidden layers. Each edge has a weight, a real number.

The neural network makes a prediction by feeding forward the input. To describe this we develop some notation. The discussion that follows is indebted to this very helpful page.


Let N be a neural network and G index the vertices of N so that each N_i is a vertex for i \in G.

For each i \in G let T_i denote the set of indices of the targets of N_i. That is the set of indices X \subset G such for each j \in X there exists a directed edge e_{ij} from N_i \to N_j. We may also conflate T_i with the set of indexed neurons if the context makes this unambiguous.

Similarly, for each i \in G let S_i denote the set of indices of the sources of N_i.

For each index i, let f_i: \mathbb{R} \to \mathbb{R} denote the activation function of N_i.

For each index i, let O_i(x) \in \mathbb{R}:=f_i(x) be the output of N_i. When the argument x is understood, we may simply write O_i.

For each directed edge e_{ij} define w_{ij} \in \mathbb{R} to be the weight of the edge.

Where the outputs are understood, for each index i, define the input I_i to be \sum \limits _{j \in S_i}w_{ji}O_j, the weighted sum of the outputs of the previous layer.

Feeding Forward

Feeding forward refers to how an input becomes an output or prediction.

As you might have guessed: feeding forward an input vector x, the input and output to each input neuron is just the corresponding component of the vector x.

From there the input to a neuron in the first hidden layer is simply the sum of the components of x, weighted by the weights of the incoming edges to that hidden node.

You repeat this for each hidden node. Then the inputs to the hidden nodes are transformed via their activation functions to their outputs. The weighted sums of these outputs are then passed on as inputs to the next layer, and so on.

This is probably best understood through a simple example.

nn1Here we have 4 neurons. The input xis two dimensional, so we have two input neurons shown in blue. We have a single hidden layer with just one hidden neuron, shown in green. Finally we have a single output neuron shown in orange. Hence the output y is one dimensional.

Choosing the Weights: Backpropagation

So we have seen how a neural network can make predictions, but how can we choose the weights so that the predictions are any good? The answer is the same as in logistic regression, we use gradient descent. In order to do this we need to use differentiable activation functions and choose an appropriate error function to minimize.

If x \in \mathbb{R}^k is the input vector, and y \in \mathbb{R}^n is the target vector, let f_N: \mathbb{R}^k \to \mathbb{R}^n be the prediction of the neural network. Then we define the error to be e(x)= \frac{1}{2}(f_N(x)-y)^2. This is the natural Euclidean norm of the difference between the target and the prediction. The factor of a half is there for aesthetical reasons, since we will be differentiating and gaining a factor of 2.

The trick to the algorithm is to figure out how the weights contribute to the error, and backpropagation is an elegant way of assigning credit/blame to each node and edge. We fix an input x, and write e for e(x), and for each each edge e_{ij} we are seeking \frac{\partial e}{\partial w_{ij}}.  Once we have found this, we set w_{ij} \mapsto w_{ij} - \lambda \frac{\partial e}{\partial w_{ij}}, where \lambda is the learning rate.  So we are doing stochastic gradient descent as in the perceptron.

It is clear that an edge e_{ij} and its weight w_{ij} contribute to the error through the target N_j‘s input. So it makes sense to divide and conquer using the chain rule, that is

\frac{\partial e}{\partial w_{ij}} = \frac{\partial e}{\partial I_j} \cdot \frac{\partial I_j}{\partial w_{ij}}.

The first part of the product is worth giving a name,  we denote it by E_j:=\frac{\partial e}{\partial I_j} and call it the error at the node.

The second part of the product is simple to calculate:

\frac{\partial E_j}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}} \sum \limits _{k \in S_j}w_{kj}O_k = O_i

Returning to E_j, we can calculate it recursively, assuming the errors for the neurons in T_j have already been calculated.

Observe that e can be thought of as a function of the variables O_k for k indexing any layer, and so using the chain rule we have

E_j = \frac{\partial e}{\partial I_j}

= \sum \limits_{k \in T_j} \frac{\partial e}{\partial I_k}\cdot \frac{\partial I_k}{\partial I_j}

= \sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial I_j}

=\sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial O_j} \cdot \frac{\partial O_j}{\partial I_j}

=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial O_j}

=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot \left ( \frac{\partial}{\partial O_j} \sum \limits _{l \in S_k}w_{lk}O_l \right )

=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot w_{jk}

notice that if we assume that each layer is fully-connected to the next, then we could write the above succinctly as an inner product, and inner products can be calculated very efficiently, for instance using a GPU, so it is often faster to have more connections!

Together this gives

\frac{\partial e}{\partial w_{ij}}=O_i \cdot f_j'(I_j)\cdot\sum \limits_{k \in T_j} E_k \cdot w_{jk}.

Since we have calculated the error at a node recursively, we had better deal with the base case –  the output neurons. But this follows simply from the definition of  e to be E_i=f'(I_i)(O_i-y_i), with the factor of 2 cancelling with the half as promised.

Bias Neurons

The final piece of the puzzle are called bias neurons, which I had left out before to simplify the discussion. Each layer, except the output layer, has a bias neuron. A bias neuron has no inputs, and its output is always 1. Effectively it just adds the weight of its outgoing edges to the inputs of the next layer.

These behave just like regular neurons, only when updating their weights, we note that since their input is effectively constant, they have \frac{\partial I_j}{\partial w_{ij}}=1, and so we need only calculate the error at a bias neuron to update its weights.

Next Time

In the next installment I will show how to compactly represent the process of feeding forward and backpropagation using vector and matrix notation assuming the layers are fully connected. We will then use this to give a Python implementation.



  1. […] so last time we introduced the feedforward neural network. We discussed how input gets fed forward to become output, and the backpropagation algorithm for […]

  2. […] don’t want to go into too much detail here, but an autoencoder is a feedforward neural network with a single hidden layer with the twist being that its target output is exactly the same as its […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: