Home » maths » A First Taste of Bayesian Theory

# A First Taste of Bayesian Theory

This semester I am taking a class in Bayesian inference and today we came across an interesting example that I would like to share with you. The post will assume a basic knowledge of probability theory, but it will be nothing a visit to Wikipedia can’t handle.

Bayes’ Theorem

Throughout I will assume we have appropriate measure spaces lurking in the background, but suppress reference to them where possible.

Given two events $A,B$ with $\mathbb{P}(B)>0$, the conditional probability of $A$ given $B$ is defined to be $\mathbb{P}(A|B)=\frac{\mathbb{P}(A \cap B)}{ \mathbb{P}(B)}$.

This intuitively captures the proportion of the measure of event $A\cap B$ in $B$, so given that $B$ has happened, this tells you how likely $A$ is.

This can be used twice to relate the both conditional probabilities and we call it Bayes’ Theorem: $\mathbb{P}(A|B) = \frac{\mathbb{P}(B|A)\mathbb{P}(A)}{\mathbb{P}(B)}$.

One intuitive way to think of this is where $A$ represents a hypothesis, and $B$ represents data or evidence. Then Bayes’ Theorem describes how to update your belief in the hypothesis based on this evidence.

We often call $P(A)$ the prior, our initial degree of belief in $A$. In this spirit we call $\mathbb{P}(A|B)$ the posterior, describing our beliefs now that we’ve found out that $B$. The conditional $\mathbb{P}(B|A)$ is called the liklihood, this measures how likely the evidence is given the data, this is often easier to calculate that the other way around, and this asymmetry is the reason Bayes’ theorem is useful.

The theorem may be naturally rephrased for the distributions of random variables $X,Y$ where $f_X(x|Y=y) = \frac{f_Y(y|X=x)f_X(x)}{f_Y(y)}$

for continuous $X,Y$, and similarly for discrete random variables or combinations of the two.

A First Example

Let’s suppose that $X$ is a Boolean random variable so that $\mathbb{P}(X=1)=p, \mathbb{P}(X=0)=1-p$, for some probability $p$. In other words $X$ has a Bernoulli distribution with parameter $p$. In our example $X$ takes as its argument a member of a population and returns $1$ if they have a certain disease, and $0$ otherwise, with probability $p$ . We call this the prevalence of the disease.

Now we introduce another random variable which is a test for the disease, call it $T$. This also takes values in $\{1,0\}$ where 1 means the test predicts that a person has the disease, and 0 that they do not. Like most tests however it is imperfect, to describe this we specify some conditional distributions: $\mathbb{P}(T=1 | X=1)=q, \ \mathbb{P}(T=0| X=0)=w$. In medical parlance, $q$ is the sensitivity of the test, and $w$ the specificity.

Now we want to know, given these parameters, what is the probability that a person has the disease, given that the test says they do?

Translating this, we are looking for $\mathbb{P}(X=1 | T=1)$. So let’s use Bayes’ Theorem to work it out:

$\mathbb{P}(X=1 | T=1)= \frac{\mathbb{P}(T=1|X=1)\mathbb{P}(X=1)}{\mathbb{P}(T=1)} = \frac{qp}{\mathbb{P}(T=1)}$,

just plugging for the top, next we use the law of total probability to deduce that

$\mathbb{P}(X=1|T=1)=\frac{qp}{\mathbb{P}(T=1 \cap X=1)+\mathbb{P}(T=1 \cap X=0)}$,

and rearrange using the conditional probability formula to give

$\mathbb{P}(X=1|T=1)=\frac{qp}{\mathbb{P}(T=1|X=1)\mathbb{P}(X=1)+\mathbb{P}(T=1 \cap X=0)\mathbb{P}(X=0)} = \frac{qp}{qp+(1-w)(1-p)}.$

Now suppose that the sensitivity $q = 0.95$ and that the specificity $w=0.98$, a pretty good test I’m sure you’d agree.

Let’s suppose that the prevalence $p=\frac{1}{1000}$, then we calculate the probability of a person having the disease given a positive test:  $\mathbb{P}(X=1|T=1) = \frac{0.95 * 0.001}{0.95 *0.001 + 0.02*0.999} = 0.045$! So a person with a positive result has a less than 5 percent chance of having the disease. I thought that this was extremely surprising, and it comes down to the prevalence of the disease, basically the less prevalent, the better your test needs to be to detect it above the noise.

An $n^{th}$ example

This got me thinking, how many positive tests would you need to conclude that a person has a disease? In other words, how fast, if it at all, does the probability of a person having the disease go to 1 as the number of positive tests, in a row, goes to infinity?

To do this we’ll think about a guy Bob who has just walked into the clinic to get tested. Because of recent legislation, everyone in the entire world has to get tested for this disease, and so the prevalence amongst those getting tested is just the same as the prevalence of the disease in general. We call this prevalence $p$ as before and treat this as known.

This time $B$ will be a random variable that is our prior belief about whether Bob has the disease, this will again be the Bernoulli distribution with parameter $p$.

The difference this time is we will perform multiple tests on the same person, but that it returns positive or negative each time ‘independently’. Now for a sequence of tests identically distributed tests $T_1, \dots , T_n$ it is of course not the case that they are independent as it is plain that $\mathbb{P}(T_{i}=1) < \mathbb{P}(T_{i}=1|T_j =1)$. What we want instead is conditional independence, namely that

$\mathbb{P}(T_i=t_1\&T_j=t_2|B=x)=\mathbb{P}(T_j=t_2|B=x) \mathbb{P}(T_i=t_1|B=x)$ for all $i,j, t, x$.

What this means is that the probability of a test getting it wrong or right is independent of the person it is testing, and depends only on whether or not the person has the disease.

Now we update our prior given that we have $n$ positive tests, that is $\bigcap \limits _{i=1}^n (T_i=1)$, which we’ll calll $E_n$ for brevity:

$\mathbb{P}(B=1 | E_n) = \frac{\mathbb{P}(E_n|B=1) \mathbb{P}(B=1)}{\mathbb{P}(E_n)}$

now using the conditional independence, and the law of total probability as before gives:

$\mathbb{P}(B=1|E_n)=\frac{q^np}{q^np+(1-w)^n(1-p)}$.

and isolating the dependence on $n$:

$\frac{1}{1+ \left ( \frac{1-w}{q} \right )^n \frac{1-p}{p}}$

and hence $\mathbb{P}(B=1 | E_n) \to 1$ as $n \to \infty$ iff $\left (\frac{1-w}{q} \right )^n \to 0$ as $n \to \infty$ iff $w+q >1$. Also notice that we need to assume that $p$ is not 1 or 0, otherwise there is nothing to do since the prevalence is total or non-existent.

So how fast does it converge? As you can see it goes to 1 at a rate of $\left(\frac{1-w}{q}\right)^n$, faster than any polynomial! So we say that event $(B=1|E_n)$ occurs with overwhelming probability.