**Defining Surprise**

Let’s suppose we have a system with states which happen with probabilities . On the other hand we have our model for the system which encode our beliefs or predictions about the states’ probabilities .

How surprised would we be if happens? What properties should the surprise have? Let’s make a wish list:

- First the surprise should be a function that takes a probability to a real number representing the surprise.
- Secondly if , then we will be infinitely surprised, so in this case.
- Similarly if , then we won’t be surprised at all, so .
- We would also like to be a nice function to work with, so let’s have it be continuously differentiable.

One final property is that I would like surprise to be *additive* or *extensive *, this means that for independent events , I would like to be . This is the main difference between probability and surprise, surprises add up, whereas probabilities multiply. I think it easier to think about adding so this makes me happy. So in general we want .

If surprise doesn’t seem to be additive to you, most likely because you have been reared on probabilities, you can substitute ‘surprise’ for ‘information’.

If you play around with these constraints you will find you need to be a negative logarithm, that is . If we choose a our base, as mathematicians prefer, we measure it in *nats*, if we work base 2, as computer scientists like, we have *bits*. This is also referred to as the self-information.

**Defining Entropy**

A natural quantity to compute is

,

the expected surprise. This is only defined when , and whenever , by convention we let . If this offends you then you should alter the definition to take the sum over the support of !

Natural quantities deserve special notation and so we write

for ,

and refer to it as the *cross entropy.*

In the special case where , where the model matches the true probabilities, we call the *entropy* of and write or for this quantity. Using Jensen’s inequality, we can show that the entropy is maximised iff for each outcome, in which case .

To get a feel for this maximization of entropy, we plot the binary case where we have just two outcomes. So letting the probability of the first outcome be , let’s see how the surprise changes with , and below that how the entropy changes:

As you can see, you can make the first outcome very surprising or informative, but you pay for that by making the complementary outcome less surprising, the lower graph shows that the optimum compromise comes from having both outcomes be equally likely.

Intuitively any approximate model will increase the surprise so that .

The difference is then a measure of much more surprised we are when using the model, this quantity is written and called the *Kullback-Leibler divergence*, which is a kind of oriented distance between distributions, notice that it is not symmetric and so not a metric.

Our intuition that can then rewritten as , and in this form it is known as *Gibbs’ inequality*.

** Proof of Gibbs’ Inequality**

This is pretty simple, since is a concave function, is convex, and we can use Jensen’s inequality, writing and we have:

.

Also notice that if so the inequality is sharp, and moreover, since we are dealing with a strictly convex function the divergence is zero if and only if .

** Conditional Entropy**

Now for two random variables we can consider the entropy of , conditioned upon . We can call this , now as ever we take the expectation to give the *conditional entropy* of given .

Including more information can only mean less surprise so we must have

.

We also have something called the *chain rule*:

,

where is the *joint entropy. *

So we have

,

and using the definition of conditional probability this becomes

which is just the joint entropy minus

and this is just the entropy, and so

,

as required.

**Mutual Information**

Similarly to the KL-divergence, we can test how much this inequality is observed, defining

,

which we call the *mutual information* of and .

In fact the analogy between mutual information and KL-divergence can be made precise if we write for the joint density and for the respective marginal densities, we have

,

to prove this is not hard and so I won’t bother. Notice that Gibb’s inequality then implies what we guessed before, that conditioning does not lose information (or increase surprise).

What is the interpretation? Well if and only if and are independent. So we are measuring the how much more surprised we will be if we assume that are independent, alternatively, how much information is lost by modelling as being independent, so it is a measure of dependence.

Fascinating! It is interesting to compare this with the correlation of and . This is also a measure of dependence, recording the extent to which large values of correspond to large values of for positive correlation, or large values of correspond to small (negative) values of . It is possible to be uncorrelated but be dependent.

By contrast mutual information is totally oblivious to large and small values, it cares only about the probabilities and so in my opinion is much more beautiful. What’s more, by Gibb’s inequality, the mutual information is zero if and only if are independent! So it is a purely informational measure of dependency, in that it fails to encode anything about how the dependency is seen from the perspective of the metric. So it really depends on the context which is more useful. For making estimates of random variables, the values are important, and so the correlation or covariance will likely be more useful.

**An Example: Guess the Number**

To illustrate the usefulness of entropy we will play a game. I will pick a number uniformly at random from the unit interval .

You can then make a guess , and I will tell you whether or . After I tell you, you have narrowed down the location of the number to if or if .

If the aim is to minimize the expected length of the interval after a guess, we should guess . This is very intuitive, but prove this we note that the expected length is

.

Since was chosen uniformly at random, and . So the expected length is

.

Note that and so , and . Therefore any extrema are minima, and since is the unique solution to , we know that guessing minimizes the expected length of the interval.

Now if is the score I give your guess, a random variable taking values so that , then we have already seen that to maximize the entropy of we should have each outcome be equally likely and so take . So the idea is when performing an experiment, or asking a question, the better you are able to predict the answer, the less you learn, and vice versa. So we see that maximizing our entropy is the best strategy as we’ve proved using calculus.

What about if we are allowed to make guesses ? If we choose to maximize our entropy with each guess, guessing halfway along each interval, then we will have narrowed down to an interval of width , but is this still the best strategy? Intuitively it does still seem like the best plan, but it isn’t quite as obvious as in the case of 1 guess.

Any deterministic strategy (that doesn’t repeat guesses!) can be described as narrowing down to intervals of lengths summing to , and the probability of the strategy ending with an interval of length is . The expected length of an interval is then .

This suggests a geometric interpretation, what we are looking for is the point on the standard simplex closest to the origin. Now if you read about projecting onto the standard simplex, you will see that we can immediately conclude that we should take .

If you don’t believe that, what you might believe using your geometric intuition is that we are looking for a sphere around the origin with radius so that the sphere is tangent to the simplex, touching it at a single point. In the 1 guess case, this looks like this:

the sphere in this case is just a circle, shown in blue. The constraint set is shown in red, the set of points with nonnegative coordinates summing to one. You can see that if you start with a small circle and expand until it hits the red line segment, the circle will hit it at the point .

So we are minimizing a function, the radius, subject to a constraint that the function be tangent to the set of points satisfying . This is precisely the situation encountered in the method of Lagrange multipliers, if you want to understand this idea I recommend checking out this blog post, Lagrangians for the Amnesiac.

In standard notation for this technique, let and . So we want to minimize subject to . The method implies that we then looking for a so that

,

which is precisely what the geometry tells us – that needs to be tangent to the standard simplex.

Solving this gives

for each ,

and so we must have

for each ,

implying that is the same for each , which forces .

So yet again the best policy is to maximize the entropy at each turn! I invite you to think about stochastic strategies.

**Next Time**

In the future we might look at how entropy is related to compression, or how the KL-divergence can be used to induce sparsity in an autoencoder.