Home » maths » information theory » Entropy: Expect a Surprise

Entropy: Expect a Surprise

Defining Surprise

Let’s suppose we have a system with n states X=\{x_1, \dots, x_n \} which happen with probabilities p(x_i).  On the other hand we have our model for the system which encode our beliefs or predictions about the states’ probabilities q(x_i).

How surprised would we be if x_i happens? What properties should the surprise have? Let’s make a wish list:

  • First the surprise should be a function s: [0,1] \to \mathbb{R} that takes a probability to a real number representing the surprise.
  • Secondly if q(x_i)=0, then we will be infinitely surprised, so s(0) = \infty in this case.
  • Similarly if q(x_i)=1, then we won’t be surprised at all, so s(1)=0.
  • We would also like s to be a nice function to work with, so let’s have it be continuously differentiable.

One final property is that I would like surprise to be additive or extensive , this means that for independent events E,F, I would like s(q(E \cap F)) = s(q(E)q(F)) to be s(q(E))+s(q(F)). This is the main difference between probability and surprise, surprises add up, whereas probabilities multiply. I think it easier to think about adding so this makes me happy. So in general we want s(ab)=s(a)+s(b).

If surprise doesn’t seem to be additive to you, most likely because you have been reared on probabilities, you can substitute ‘surprise’ for ‘information’.

If you play around with these constraints you will find you need s to be a negative logarithm, that is s(x) =\log \frac{1}{x}. If we choose e a our base, as mathematicians prefer, we measure it in nats, if we work base 2,  as computer scientists like, we have bits. This is also referred to as the self-information.

Defining Entropy

A natural quantity to compute is

\mathbb{E}s = \sum \limits _{i=1}^np(x_i)\log\frac{1}{q(x_i)} = \mathbb{E}_p(-\log q),

the expected surprise. This is only defined when q(x_i)=0 \Rightarrow p(x_i)=0, and whenever p(x_i)=0, by convention we let 0 \cdot \log 0 = 0 \cdot \log 0/0=0. If this offends you then you should alter the definition to take the sum over the support of p!

Natural quantities deserve special notation and so we write

H(p,q) for \mathbb{E}s,

and refer to it as the cross entropy.

In the special case where p=q, where the model matches the true probabilities, we call H(p,p) the entropy of X and write H(p) or H(X) for this quantity. Using Jensen’s inequality, we can show that the entropy is maximised iff p(x_i)=\frac{1}{n} for each outcome, in which case H(X)=\log n.

To get a feel for this maximization of entropy, we plot the binary case where we have just two outcomes. So letting the probability of the first outcome be p, let’s see how the surprise changes with p, and below that how the entropy changes:

entropy_versus_surprise

As you can see, you can make the first outcome very surprising or informative, but you pay for that by making the complementary outcome less surprising, the lower graph shows that the optimum compromise comes from having both outcomes be equally likely.

Intuitively any approximate model q will increase the surprise so that H(p) \le H(p,q).

The difference H(p,q)-H(p)  is then a measure of much more surprised we are when using the model, this quantity is written d_{KL}(p \Vert q) and called the Kullback-Leibler divergence, which is a kind of oriented distance between distributions, notice that it is not symmetric and so not a metric.

Our intuition that H(p) \le H(p,q) can then rewritten as d_{KL}(p \Vert q) \ge 0, and in this form it is known as Gibbs’ inequality.

 Proof of Gibbs’ Inequality

This is pretty simple, since \log(x) is a concave function, -\log x=\log \frac{1}{x} is convex, and we can use Jensen’s inequality, writing p_i:=p(x_i) and q_i:=q(x_i) we have:

d_{KL}(p\Vert q)=\sum \limits _{i=1}^n p_i \log\frac{p_i}{q_i}=-\sum \limits _{i=1}^n p_i \log\frac{q_i}{p_i} \ge - \log \left (\sum \limits _{i=1}^np_i\cdot \frac{q_i}{p_i} \right )=-\log(1)=0.

Also notice that d_{KL}(p\Vert q)=0 if p=q so the inequality is sharp, and moreover, since we are dealing with a strictly convex function the divergence is zero if and only if p=q.

 Conditional Entropy

Now for two random variables X,Y we can consider the entropy of Y | X=x, Y conditioned upon X=x. We can call this H(Y|X=x), now as ever we take the expectation to give H(Y|X):=\mathbb{E}H(Y|X=x) the conditional entropy of Y given X.

Including more information can only mean less surprise so we must have

H(Y|X) \le H(Y).

We also have something called the chain rule:

H(Y|X) = H(X,Y)-H(X),

where H(X,Y):=\sum \limits _{x,y}p(x,y)\log (\frac{1}{p(x,y)}) is the joint entropy.

So we have

H(Y|X)=\sum \limits_{x}p(x)H(Y|X=x) = \sum \limits_{x}p(x) \sum \limits_{y}p(y|x)\log(\frac{1}{p(y|x)}) = \sum \limits_{x,y}p(y|x)p(x)\log(\frac{1}{p(y|x)}),

and using the definition of conditional probability this becomes

\sum \limits_{x,y}p(y,x) \log(\frac{1}{p(y|x)})= \sum \limits_{x,y}p(y,x) \log(\frac{p(x)}{p(y,x)})

which is just the joint entropy minus

H(X,Y) - \sum \limits_{x,y}p(y,x) \log (\frac{1}{p(x)}) = H(X,Y)- \sum \limits_{x}p(x)\log (\frac{1}{p(x)})

and this is just the entropy, and so

H(Y|X)=H(X,Y)-H(X),

as required.

Mutual Information

Similarly to the KL-divergence, we can test how much this inequality is observed, defining

I(X,Y) = H(X)- H(X|Y),

which we call the mutual information of X and Y.

In fact the analogy between mutual information and KL-divergence can be made precise if we write p_{X,Y} for the joint density and p_X, p_Y for the respective marginal densities, we have

I(X,Y) = d_{KL}(p_{X,Y}\Vert p_X \cdot p_Y),

to prove this is not hard and so I won’t bother. Notice that Gibb’s inequality then implies what we guessed before, that conditioning does not lose information (or increase surprise).

What is the interpretation? Well p_{X,Y}=p_X\cdot p_Y if and only if X and Y are independent. So we are measuring the how much more surprised we will be if we assume that X,Y are independent, alternatively, how much information is lost by modelling X,Y as being independent, so it is  a measure of dependence.

Fascinating! It is interesting to compare this with the correlation of X and Y. This is also a measure of dependence, recording the extent to which large values of X correspond to large values of Y for positive correlation, or large values of X correspond to small (negative) values of Y. It is possible to be uncorrelated but be dependent.

By contrast mutual information is totally oblivious to large and small values, it cares only about the probabilities and so in my opinion is much more beautiful. What’s more, by Gibb’s inequality, the mutual information is zero if and only if X,Y are independent! So it is a purely informational measure of dependency, in that it fails to encode anything about how the dependency is seen from the perspective of the metric. So it really depends on the context which is more useful. For making estimates of random variables, the values are important, and so the correlation or covariance will likely be more useful.

An Example: Guess the Number

To illustrate the usefulness of entropy we will play a game. I will pick a number y uniformly at random from the unit interval I=[0,1].

You can then make a guess x \in [0,1], and I will tell you whether x \le y or x > y. After I tell you, you have narrowed down the location of the number to [x,1] if x \le y or [0,x] if x>y.

If the aim is to minimize the expected length of the interval after a guess, we should guess x=\frac{1}{2}. This is very intuitive, but prove this we note that the expected length is

(1-x)\cdot P(x \le y) + x \cdot P(x >y).

Since y was chosen uniformly at random, P(x \le y)=1-x and P(x>y)=x. So the expected length is

L(x):= x^2 + (1-x)^2.

Note that L(x) = 2x^2 -2x +1 and so L'(x)=4x-2, and L''(x) = 4. Therefore any extrema are minima, and since x=1/2 is the unique solution to L'(x)=0, we know that guessing x=1/2 minimizes the expected length of the interval.

Now if S_x is the score I give  your guess,  a random variable taking values \{0,1\} so that P(S_x=1)=x, then we have already seen that to maximize the entropy of H(S_x) we should have each outcome be equally likely and so take x=1/2. So the idea is when performing an experiment, or asking a question, the better you are able to predict the answer, the less you learn, and vice versa. So we see that maximizing our entropy is the best strategy as we’ve proved using calculus.

What about if we are allowed to make n guesses x_1, \dots ,x_n? If we choose to maximize our entropy with each guess, guessing halfway along each interval, then we will have narrowed y down to an interval of width 2^{-n}, but is this still the best strategy? Intuitively it does still seem like the best plan, but it isn’t quite as obvious as in the case of 1 guess.

Any deterministic strategy (that doesn’t repeat guesses!) can be described as narrowing y down to 2^n intervals of lengths p:=(p_1, \dots , p_{2^n}) summing to 1, and the probability of the strategy ending with an interval of length p_i is p_i. The expected length of an interval is then L_p =\sum \limits _{i=1}^{2^n}p_i^2 = <p,p> = \Vert p \Vert^2.

This suggests a geometric interpretation, what we are looking for is the point on the standard simplex \triangle_{2^n-1} closest to the origin. Now if you read about projecting onto the standard simplex, you will see that we can immediately conclude that we should take p=(2^{-n}, \dots, 2^{-n}).

If you don’t believe that, what you might believe using your geometric intuition is that we are looking for a sphere S_r around the origin with radius r so that the sphere is tangent to the simplex, touching it at a single point. In the 1 guess case, this looks like this:

circle_tangent_simplex

the sphere in this case is just a circle, shown in blue. The constraint set is shown in red, the set of points with nonnegative coordinates summing to one. You can see that if you start with a small circle and expand until it hits the red line segment, the circle will hit it at the point (\frac{1}{2},\frac{1}{2}).

So we are minimizing a function, the radius, subject to a constraint that the function be tangent to the set of points q satisfying \sum \limits _{i=1}^{2^n}q_i =1, \ q_i \ge 0. This is precisely the situation encountered in the method of Lagrange multipliers, if you want to understand this idea I recommend checking out this blog post, Lagrangians for the Amnesiac.

In standard notation for this technique, let f(x) = <x,x> and g(x) = -1 + \sum \limits _{i=1}^{2^n}x_i. So we want to minimize f subject to g(x)=0. The method implies that we then looking for a \lambda so that

\nabla f - \lambda \nabla g = 0,

which is precisely what the geometry tells us – that f needs to be tangent to the standard simplex.

Solving this gives

2x_i - \lambda =0 for each i=1, \dots, 2^n,

and so we must have

x_i = \frac{\lambda}{2} for each i,

implying that x_i is the same for each i, which forces x_i=2^{-n}.

So yet again the best policy is to maximize the entropy at each turn! I invite you to think about stochastic strategies.

Next Time

In the future we might look at how entropy is related to compression, or how the KL-divergence can be used to induce sparsity in an autoencoder.

 

 

 

 

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: