Chapter 6: The devil is in the denominator Flashcards

Question 1

Q

What function does the denominator in Bayes’ rule carry out?

Answer

A

The denominator of Bayes’ rule, p(data), is a number that ensures that the posterior distribution is a valid probability distribution by normalising the numerator term.

Question 2

Q

What is an alternative interpretation of the denominator?

Answer

A

There is, however, another interpretation of the denominator. Before we get the data, it is a probability distribution that represents our beliefs over all possible data samples.

Question 3

Q

How do we obtain the denominator?

Answer

A

We marginalise out all parameter dependence in the numerator.

Question 4

Q

How simple is this task of calculating the denominator?

Answer

A

The seeming simplicity of the previous statement belies the fact that, for most circumstances, this calculation is complicated and practically intractable.

Question 5

Q

Where does the numerator fail to become a valid probability density?

Answer

A

The numerator satisfies the first condition of a valid probability density – its values are non- negative. However, it falls down on the second test – its sum or integral (dependent on whether the parameters are discrete or continuous) across all parameter values does not typically equal 1.

Question 6

Q

Why does the denominator not contain θ?

Answer

A

This is because p(data) is a marginal probability density, obtained by summing or integrating out all dependence on θ. This parameter independence of the denominator ensures that the influence of θ on the shape of the posterior distribution is solely due to the numerator

Question 7

Q

There are two ways in which we will use bayes rule which use slightly different (although conceptually identical) versions of the denominator. What are these two versions for discrete data?

Answer

A

Pr(data) = E(All θ) Pr(data, ,θ)
Pr(data) = E(All θ) Pr(data|θ)×Pr(θ).

Question 8

Q

There are two ways in which we will use bayes rule which use slightly different (although conceptually identical) versions of the denominator. What are these two versions for continuous data?

Answer

A

For continuous parameters we use the continuous analogue of the sum – an integral – to calculate a denominator of the form:
Pr(data) = S(All θ) Pr(data, ,θ)
Pr(data) = S(All θ) Pr(data|θ)×Pr(θ).

Question 9

Q

Imagine that we are a medical practitioner and want to calculate the probability that a patient has a particular disease. We use θ to represent the two possible outcomes:

θ = {0, disease positive; 1, disease negative}

Taking account of the patient’s medical history, we specify a prior probability of 1/4 that they have the disease. We subsequently obtain data from a diagnostic test and use this to re-evaluate the probability that the patient is disease-positive. To do this we choose a probability model (likelihood) of the form:

What do we implicitly assume about the probability of a negative test result in this model?

Answer

A

We implicitly assume that the probability of a negative test result equals 1 minus the positive test probabilities.

Question 10

Q

Imagine that we are a medical practitioner and want to calculate the probability that a patient has a particular disease. We use θ to represent the two possible outcomes:

θ = {0, disease positive; 1, disease negative}

Taking account of the patient’s medical history, we specify a prior probability of 1/4 that they have the disease. We subsequently obtain data from a diagnostic test and use this to re-evaluate the probability that the patient is disease-positive. To do this we choose a probability model (likelihood) of the form:

Pr(test positive|θ) = {1/10, θ = 0; 4/5, θ = 1}
--------------------------------
Through  Pr(test positive | θ = 0) > 0 what do we assume?

Answer

A

Since Pr(test positive | θ = 0) > 0 we are assuming that false positives do occur.

Question 11

Q

Imagine that we are a medical practitioner and want to calculate the probability that a patient has a particular disease. We use θ to represent the two possible outcomes:

θ = {0, disease positive; 1, disease negative}

Taking account of the patient’s medical history, we specify a prior probability of 1/4 that they have the disease. We subsequently obtain data from a diagnostic test and use this to re-evaluate the probability that the patient is disease-positive. To do this we choose a probability model (likelihood) of the form:

Suppose that the individual test result is positive for the disease. Use the following expression to calculate the denominator of Bayes’ rule in this case:
Pr(data) = E(All θ) Pr(data, ,θ)
Pr(data) = E(All θ) Pr(data|θ)×Pr(θ).

Answer

A

Pr(test positive) = E(1, θ = 0) Pr(test positive |θ)× Pr(θ)
=Pr(test positive|θ =0)×Pr(θ =0)+Pr(test positive|θ =1)×Pr(θ =1)
= 1/10 x 3/4 + 4/5 x 1/4 = 11/40

Question 12

Q

Is this denominator a valid probability density? What does this mean we can or cannot do?

Answer

A

The denominator is a valid probability density, meaning that we can calculate the counter-factual Pr(test negative) = 1 − Pr(test positive) = 29/40 .

Question 13

Q

Why should we be careful in interpreting this counter-factual?

Answer

A

We need to be careful with interpreting this last result, however, since it did not actually occur; Pr(test negative) is our model-implied probability that the individual will test negatively before we carry out the test and obtain the result.

Question 14

Q

What do we then do to obtain the posterior probability that the individual has the disease, given that they test positive?

Answer

A

Use bayes rule:
Pr(θ =1|test positive)= 
Pr(test positive |θ = 1)× Pr(θ = 1) / Pr(test positive)
= 4/5 x 1/4 / 11/ 40
=8 / 11

Question 15

Q

What is an alternative view of the denominator in regards to it being a distribution?

Answer

A

An alternative view of the denominator is as a probability distribution for the data before we observe it – in other words, the probability distribution for a future data sample given our choice of model.

Question 16

Q

What is meant by a model in the alternative view of the denominator?

Answer

A

Here model encompasses both the likelihood and the prior.

Question 17

Q

What kind of probability distribution is the denominator and how is it obtained?

Answer

A

The denominator is actually a marginal probability density that is obtained by integrating the joint density p(data,θ) across all θ

Question 18

Q

What allows this joint density to be a valid probability distribution? How does this contrast with the numerator?

Answer

A

This joint density is a function of both the data and θ, and so is a valid probability distribution. This contrasts with the numerator in Bayesian inference (which is not a valid probability distribution), where we vary θ but hold the data constant.

Question 19

Q

The previous examples illustrate that the denominator of Bayes’ rule is obtained by summing (for discrete variables) or integrating (for continuous variables) the joint density p(data,θ ) across the range of θ.

How have these samples not been like how real life examples often are?

Answer

A

We have seen how this procedure works when there is a single parameter in the model. However, in most real-life applications of statistics, the likelihood is a function of a number of parameters.

Question 20

Q

For the case of a two-parameter discrete model, how is the denominator given?

Answer

A

For the case of a two-parameter discrete model, the denominator is given by a double sum:
p(data) = Σ(Allθ1) Σ(Allθ2) p(data,θ1 ,θ2 )

Question 21

Q

For the case of a model with two continuous parameters, how is the denominator given?

Answer

A

p(data) = ∫(Allθ1) ∫(Allθ1) p(data,θ1 ,θ2 )dθ1dθ2

Question 22

Q

Does this increase the difficulty of the calculations by much?

Answer

A

While the two-parameter expressions may not look more intrinsically difficult than their single-parameter counterparts, respectively, this aesthetic similarity is misleading, particularly for the continuous case.

While in the discrete case it is possible to enumerate all parameter values and hence – by brute force – calculate the exact value of p(data), for continuous parameters the integral may be difficult to calculate. This difficulty is amplified the more parameters our model has, rendering analytic calculation of the denominator practically impossible for all but the simplest models.

Question 23

Q

How effective would using an approximate numerical scheme that uses a deterministic method to estimate the above integral, for example Gaussian quadrature be?

Answer

A

The example integral is 200-dimensional, which is impossible to exactly calculate. Furthermore, any approximate numerical scheme that uses a deterministic method to estimate the above integral, for example Gaussian quadrature, will also fail to work. For relatively complex problems we simply cannot calculate the denominator of Bayes’ rule. This means we cannot normalise the numerator and, in doing so, transform it into a valid probability distribution.

Question 24

Q

In fact, even if we could calculate the denominator of Bayes’ rule, we would still have difficulties.

Explain why this is the case

Answer

A

A common summary measure of a posterior distribution is the posterior mean. Suppose for our school test example we want to calculate the posterior mean of μ1, which represents the mean test score for school 1. In this case, we would want to calculate the integral. where we have multiplied the posterior by μ1 to find the posterior mean of this parameter. Since this integral is also 200-dimensional, we will have the same problems as we did for the denominator. This means that, in most circumstances, we cannot exactly calculate the mean, variance or any other summary measure of the posterior for that matter!

Question 25

Q

Where does this difficulty in calculating multi dimensional denominators in bayesian statistics stem from?

Answer

A

So for relatively complex models, it seems that we are in trouble. This issue originates from the inherent complexity of integrating multidimensional probability distributions, not just the diffi- culty in calculating the denominator term of Bayes’ rule.

Question 26

Q

If a model has more than about three parameters, then it is difficult to calculate any of the integrals necessary to do applied Bayesian inference. What can we do in this scenario?

Answer

A

All is not lost. In these circumstances, we can take a different route. There are two solutions to the difficulty:
• Use priors conjugate to the likelihood.
• Abandon exact calculation, and opt to sample from the posterior instead.

Question 27

Q

What does utilising conjugate priors allow for? How does this simplify the analysis?

Answer

A

Using conjugate priors still allows exact derivation of the posterior distribution (and usually most summary measures) by choosing a mathematically ‘nice’ form for the prior distribution. This simplifies the analysis since we can simply look up tabulated formulae for the posterior, avoiding the need to do any maths.

Question 28

Q

Can you use conjugate priors in real-life applications of Bayesian statistics?

Answer

A

However, in real-life applications of Bayesian statistics, we often need to stray outside this realm of mathematical convenience. The price we pay for a wider choice of priors and likelihoods is that we must stop aspiring for exact results.

Question 29

Q

For example, we cannot hope to exactly calculate the posterior mean, standard deviation and any uncertainty intervals for parameters. What can we do in this scenario?

Answer

A

in these circumstances, we can sample from the posterior and then use sample summary statistics to describe the posterior.

Question 30

Q

What would the posterior density be written as in this scenario?

Answer

A

p(θ|data)= ( p(data|θ)×p(θ)) / p(data)

∝ p(data|θ)× p(θ)

Question 31

Q

Why is this second line (∝ p(data|θ)× p(θ)) useful?

Answer

A

We obtained the second line because p(data) is independent of θ – it is a constant that we use to normalise the posterior. Therefore, the numerator of Bayes’ rule tells us everything we need to know about the shape of the posterior distribution, whereas the denominator merely tells us about its height. Fortunately, we only require information on the shape of the posterior to generate samples from it. This forms the basis of all modern computational methods