Home » machine learning » Neural Networks Part 1: Feedingforward and Backpropagation

# Neural Networks Part 1: Feedingforward and Backpropagation

Introduction

Last time we looked at the perceptron model for classification, which is similar to logistic regression, in that it uses a nonlinear transformation of a linear combination of the input features to give an output class.

A neural network generalizes this idea by conceiving of a network of nodes which each take as an input a linear combination of the outputs of other nodes, then use some activation function to turn this into an output, such as:

A Feedforward Neural Network

A neural network is just a directed graph. We will be looking at feedforward neural networks, which are acyclic.

Each node is called a neuron. There are three main kinds of neurons:

• Input neurons, one neuron for each feature, so if our input were vectors in $\mathbb{R}^3$, we would have 3 input neurons. These neurons have trivial activation functions and no incoming edges.
• Output neurons, one neuron for each output feature, so if our so if our output were vectors in $\mathbb{R}^3$, we would have 3 output neurons. These will typically have a nonlinear activation function, chosen so that the output range matches our training data. These neurons have no outgoing edges.
• Hidden neurons are in between input and output neurons and have both incoming and outgoing edges.

Since the graph is acyclic, we can think of it as being composed of layers, as in the following picture taken from wikipedia:

although note there can be many hidden layers. Each edge has a weight, a real number.

The neural network makes a prediction by feeding forward the input. To describe this we develop some notation. The discussion that follows is indebted to this very helpful page.

Notation

Let $N$ be a neural network and $G$ index the vertices of $N$ so that each $N_i$ is a vertex for $i \in G$.

For each $i \in G$ let $T_i$ denote the set of indices of the targets of $N_i$. That is the set of indices $X \subset G$ such for each $j \in X$ there exists a directed edge $e_{ij}$ from $N_i \to N_j$. We may also conflate $T_i$ with the set of indexed neurons if the context makes this unambiguous.

Similarly, for each $i \in G$ let $S_i$ denote the set of indices of the sources of $N_i$.

For each index $i$, let $f_i: \mathbb{R} \to \mathbb{R}$ denote the activation function of $N_i$.

For each index $i$, let $O_i(x) \in \mathbb{R}:=f_i(x)$ be the output of $N_i$. When the argument $x$ is understood, we may simply write $O_i$.

For each directed edge $e_{ij}$ define $w_{ij} \in \mathbb{R}$ to be the weight of the edge.

Where the outputs are understood, for each index $i$, define the input $I_i$ to be $\sum \limits _{j \in S_i}w_{ji}O_j$, the weighted sum of the outputs of the previous layer.

Feeding Forward

Feeding forward refers to how an input becomes an output or prediction.

As you might have guessed: feeding forward an input vector $x$, the input and output to each input neuron is just the corresponding component of the vector $x$.

From there the input to a neuron in the first hidden layer is simply the sum of the components of $x$, weighted by the weights of the incoming edges to that hidden node.

You repeat this for each hidden node. Then the inputs to the hidden nodes are transformed via their activation functions to their outputs. The weighted sums of these outputs are then passed on as inputs to the next layer, and so on.

This is probably best understood through a simple example.

Here we have 4 neurons. The input $x$is two dimensional, so we have two input neurons shown in blue. We have a single hidden layer with just one hidden neuron, shown in green. Finally we have a single output neuron shown in orange. Hence the output $y$ is one dimensional.

Choosing the Weights: Backpropagation

So we have seen how a neural network can make predictions, but how can we choose the weights so that the predictions are any good? The answer is the same as in logistic regression, we use gradient descent. In order to do this we need to use differentiable activation functions and choose an appropriate error function to minimize.

If $x \in \mathbb{R}^k$ is the input vector, and $y \in \mathbb{R}^n$ is the target vector, let $f_N: \mathbb{R}^k \to \mathbb{R}^n$ be the prediction of the neural network. Then we define the error to be $e(x)= \frac{1}{2}(f_N(x)-y)^2$. This is the natural Euclidean norm of the difference between the target and the prediction. The factor of a half is there for aesthetical reasons, since we will be differentiating and gaining a factor of 2.

The trick to the algorithm is to figure out how the weights contribute to the error, and backpropagation is an elegant way of assigning credit/blame to each node and edge. We fix an input $x$, and write $e$ for $e(x)$, and for each each edge $e_{ij}$ we are seeking $\frac{\partial e}{\partial w_{ij}}$.  Once we have found this, we set $w_{ij} \mapsto w_{ij} - \lambda \frac{\partial e}{\partial w_{ij}}$, where $\lambda$ is the learning rate.  So we are doing stochastic gradient descent as in the perceptron.

It is clear that an edge $e_{ij}$ and its weight $w_{ij}$ contribute to the error through the target $N_j$‘s input. So it makes sense to divide and conquer using the chain rule, that is

$\frac{\partial e}{\partial w_{ij}} = \frac{\partial e}{\partial I_j} \cdot \frac{\partial I_j}{\partial w_{ij}}$.

The first part of the product is worth giving a name,  we denote it by $E_j:=\frac{\partial e}{\partial I_j}$ and call it the error at the node.

The second part of the product is simple to calculate:

$\frac{\partial E_j}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}} \sum \limits _{k \in S_j}w_{kj}O_k = O_i$

Returning to $E_j$, we can calculate it recursively, assuming the errors for the neurons in $T_j$ have already been calculated.

Observe that $e$ can be thought of as a function of the variables $O_k$ for $k$ indexing any layer, and so using the chain rule we have

$E_j = \frac{\partial e}{\partial I_j}$

$= \sum \limits_{k \in T_j} \frac{\partial e}{\partial I_k}\cdot \frac{\partial I_k}{\partial I_j}$

$= \sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial I_j}$

$=\sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial O_j} \cdot \frac{\partial O_j}{\partial I_j}$

$=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot \frac{\partial I_k}{\partial O_j}$

$=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot \left ( \frac{\partial}{\partial O_j} \sum \limits _{l \in S_k}w_{lk}O_l \right )$

$=f_j'(I_j)\sum \limits_{k \in T_j} E_k \cdot w_{jk}$

notice that if we assume that each layer is fully-connected to the next, then we could write the above succinctly as an inner product, and inner products can be calculated very efficiently, for instance using a GPU, so it is often faster to have more connections!

Together this gives

$\frac{\partial e}{\partial w_{ij}}=O_i \cdot f_j'(I_j)\cdot\sum \limits_{k \in T_j} E_k \cdot w_{jk}$.

Since we have calculated the error at a node recursively, we had better deal with the base case –  the output neurons. But this follows simply from the definition of  $e$ to be $E_i=f'(I_i)(O_i-y_i)$, with the factor of 2 cancelling with the half as promised.

Bias Neurons

The final piece of the puzzle are called bias neurons, which I had left out before to simplify the discussion. Each layer, except the output layer, has a bias neuron. A bias neuron has no inputs, and its output is always 1. Effectively it just adds the weight of its outgoing edges to the inputs of the next layer.

These behave just like regular neurons, only when updating their weights, we note that since their input is effectively constant, they have $\frac{\partial I_j}{\partial w_{ij}}=1$, and so we need only calculate the error at a bias neuron to update its weights.

Next Time

In the next installment I will show how to compactly represent the process of feeding forward and backpropagation using vector and matrix notation assuming the layers are fully connected. We will then use this to give a Python implementation.