Chapter 4: Likelihoods Flashcards by oisin mcelwain

Again what is Bayes’ rule cousin? Gun to your head rn say it or you die in front of your wife and kids bruh. WHAT IS BAYES RULE?! SAY IT, SAY IT WITH YOUR CHEST

p(θ | data) = p(data | θ) * p(θ) / p(data)

How well did you know this?

Not at all

Perfectly

What is p(data | θ) known as?

the bayesians call this numerator term a likelihood

How well did you know this?

Not at all

Perfectly

What does this mean in simple, everyday language? Give an example to demonstrate why this nomenclature is used

Imagine that we flip a coin and record its outcome. The simplest model to represent this outcome ignores the angle the coin was thrown at, and its height above the surface, along with any other details. Because of our ignorance, our model cannot perfectly predict the behaviour of the coin. This uncertainty means that our model is probabilistic rather than deterministic.

How well did you know this?

Not at all

Perfectly

Explain the reasoning behind the extra term in the following equation:

Pr(H,H |θ,Model) = Pr(H |θ,Model)× Pr(H |θ,Model)

Model represents the set of assumptions that we make in our analysis. We generally omit this term on the understanding that it is implicit.

How well did you know this?

Not at all

Perfectly

Why specifically in bayesian inference do Bayesians insist on calling p(data| θ ) a likelihood, not a probability?

This is because in Bayesian inference we do not keep the parameters of our model fixed. In Bayesian analysis, the data are fixed and the parameters vary. In particular, Bayes’ rule tells us how to calculate the posterior probability density for any value of θ.

How well did you know this?

Not at all

Perfectly

Consider flipping a coin whose inherent bias, θ, is unknown beforehand. Suppose that we flip our coin twice and obtain one head and one tail. How can we our our model to predict the probability of this data?

In Bayesian inference, we use a sample of coin flip outcomes to estimate a posterior belief in any value of θ. To obtain p(θ |data) we must compute p(data|θ) in the numerator of Bayes’ rule for each possible value of θ. We can use our model to calculate the probability of this data for any value of θ:

Pr(H,T |θ) + Pr(T,H |θ) = θ(1−θ)+θ(1−θ)
= 2θ(1−θ)

How well did you know this?

Not at all

Perfectly

What does this (probability of the data) yield?

This result yields the probability for a fixed data sample (one head and one tail) as a function of θ.

How well did you know this?

Not at all

Perfectly

Comment on what we obtain if we graph the probability for a fixed data sample (one head and one tail) as a function of θ.

It might appear that the graph is a continuous probability distribution, but looks can deceive. While all the values of the distribution are non-negative, if we calculate the area underneath the curve we obtain:

S(1,0) 2θ(1 - θ)dθ = 1/3

Which does not equal 1. Thus our distribution is not a valid probability distribution.

How well did you know this?

Not at all

Perfectly

What implication does it have that this does not posterior distribution is not valid ?

Hence, when we vary θ, p(data|θ) is not a valid probability distribution. We thus introduce the term likelihood to describe p(data|θ) when we vary the parameter, θ.

How well did you know this?

Not at all

Perfectly

What notation is generally used to emphasise that a likelihood is a function of the parameter θ with the data held fixed?

L(θ |data)= p(data|θ)

How well did you know this?

Not at all

Perfectly

What is L(θ |data)= p(data|θ) referred to as and why?

We call L(θ |data)= p(data|θ) the equivalence relation since a likelihood of θ for a particular data sample is equivalent to the probability of that data sample for that value of θ.

How well did you know this?

Not at all

Perfectly

What reasons are given in the book to eplicitly stating our models (be it statistical, biological, sociological etc)

To predict.
To explain.
To guide data collection.
To discover new questions.
To bound outcomes to plausible ranges.
To illuminate uncertainties.
To challenge the robustness of prevailing theory through perturbations.
To reveal the apparently simple (complex) to be complex (simple).

How well did you know this?

Not at all

Perfectly

Whenever we build a model what questions should we ask?

Whenever we build a model, whether it is statistical, biological or sociological, we should ask: What do we hope to gain by building this model, and how can we judge its success? Only when we have answers to these basic questions should we proceed to model building.

How well did you know this?

Not at all

Perfectly

Bayesians are acutely aware that their models are wrong. At best, these simple abstractions can explain some aspect of real behaviour; at worst, they can be very misleading. With this in mind describe the framework described in the book for building a model (4 steps)

1 Write down the real-life behaviour that the model should be capable of explaining.

2 Write down the assumptions that it is believed are reasonable to achieve step 1.

3 Search Chapter 8 for probability models that are based on these assumptions. If necessary, combine different models to produce a resultant model that encompasses all assumptions.

4 After fitting the model to data, test its ability to explain the behaviour identified in step 1. If unsuccessful, go back to step 2 and assess which of your assumptions are likely violated. Then choose a new, more general, probability model that encompasses these new assumptions.

How well did you know this?

Not at all

Perfectly

The goal of an analysis is to estimate a probability, θ, that a randomly chosen individual has a disease. We now calculate the probability of each outcome for our sample of one individual:

Pr(X =0|θ)=(1−θ)
Pr(X =1|θ)=θ.

We want to write down a single rule which yields either of the expressions, dependent on whether X = 0 or X = 1

Give a rule which achieves this, name it and describe why it does

Pr(X =α |θ) = θ^α (1−θ)^1−α
where α ∈ {0,1} is the numeric value of the variable X.

This expression is known as a Bernoulli probability density. It reduces to either of the expressions if the individual is disease-positive or -negative, respectively:

Pr(X=0|θ) = θ^0 (1−θ)^1−0 =(1−θ)
Pr(X=1|θ)= θ^1 (1−θ)^1−1 = θ.

How well did you know this?

Not at all

Perfectly

How does the graph of this function depend on theta?

For a fixed value of θ, the sum (in the figure in notes, the vertical sum) of the two probabilities always equals 1, and so this expression is a valid discrete probability density. By contrast, when we hold the data X fixed and vary θ, the distribution is continuous, and the area under the curve is not 1 (bottom-right panel), meaning this expression is a likelihood.

How well did you know this?

Not at all

Perfectly

Now imagine that instead of a solitary individual, we have a sample of N individuals, and want to develop a model that yields the probability of obtaining Z disease cases in this sample.

What assumptions would we have to make in this case? (2) State these assumptions in statistical terms

We assume that one individual’s disease status does not influence the probability that another individual in the sample has the disease (not satisfied if contagious and in close proximity). This assumption is called statistical independence

We also assume that all individuals in our sample are from the same population.

Combining these two assumptions, we say in statistical language that our data sample is composed of independent and identically distributed observations, or alternatively we say that we have a random sample.

How well did you know this?

Not at all

Perfectly

With our two assumptions, We can formulate a model for the probability of Z disease-positive individuals in a total sample size of N. What is the first step in doing this?

Study These Flashcards

First consider each person’s disease status individually (Reuse Bernoulli expression)

In the probability of having a disease example. How does the assumption of independence help us in building our model?

Study These Flashcards

1) Assuming independence, we obtain the overall probability by multiplying together the individual probabilities. For N = 2, this means we obtain the probability that the first person has disease status X1 and the second person has status X2:

Pr(X1 =α1,X2 =α2 | θ1, θ2) = Pr(X1 =α1 |θ1) × Pr(X2 =α2 |θ2)
= θ1^α1 (1−θ1 )^1−α1 × θ2^α2 (1−θ2 )^1−α2

2) By assuming identically distributed observations, we can set θ1 =θ2 =θ:
= θ^α1 (1−θ )^1−α1 × θ^α2 (1−θ )^1−α2
= θ^α1+a2 (1−θ )^2−α1-a2

For our sample of two individuals, we can now calculate the probability of obtaining Z cases of the disease.

Use the expression to generate the respective probabilities:
θ^α1+a2 (1−θ )^2−α1-a2

Study These Flashcards

We first realise that:
Z = X1 +X2.

We then use the expression to generate the respective probabilities:
Pr(Z=0|θ)=Pr(X1 =0,X2 =0|θ)
= θ^0+0 (1−θ )^2−0-0
= (1- θ)^2

Pr(Z=1|θ)=Pr(X1 =1,X2 =0|θ)+Pr(X1 =0,X2 =1|θ)
=2θ(1−θ)

Pr(Z=2|θ)=Pr(X1 =1,X2 =1|θ)
= θ^1+1 (1−θ )^2−1-1
= θ^2

Pr(Z=0|θ)=Pr(X1 =0,X2 =0|θ)
= θ^0+0 (1−θ )^2−0-0
= (1- θ)^2

Pr(Z=1|θ)=Pr(X1 =1,X2 =0|θ)+Pr(X1 =0,X2 =1|θ)
=2θ(1−θ)

Pr(Z=2|θ)=Pr(X1 =1,X2 =1|θ)
= θ^1+1 (1−θ )^2−1-1
= θ^2

Rewrite the above to determine a single rule for calculating the probability of any possible value of Z

Study These Flashcards

Pr(Z =0|θ) = θ0(1−θ)^2
Pr(Z =1|θ) = 2θ1(1−θ)^1
Pr(Z =2|θ) = θ2(1−θ)^0

In all the above expressions we notice a common term θ^β (1 − θ )^2−β, where β ∈ {0,1,2} represents the number of disease cases found. This suggests that we can write down a single rule of the form:

Pr(Z = 2| θ) ~ θ^β (1 − θ )^2−β,

What problem do we face matching

Pr(Z =0|θ) = θ0(1−θ)^2
Pr(Z =1|θ) = 2θ1(1−θ)^1
Pr(Z =2|θ) = θ2(1−θ)^0

Pr(Z = 2| θ) ~ θ^β (1 − θ )^2−β

and how do we resolve this issue?

Study These Flashcards

The only problem with matching this expression to those expressions is the factor of 2 in the middle of the new expression.

To resolve this issue we realise that when a quadratic is expanded we obtain:
(x+1)^2 = x^2 + 2x +1

where the numbers {1,2,1} are the coefficients of {x^2 ,x^1,x^0 }, respectively. This sequence of numbers are known as either the binomial expansion coefficients or simply “Cr,.

These coefficients are typically written as:
| (2, B) | = 2! / (2 - B)! B!
where ! means factorial and β ∈{0,1,2}. This gives us either 1 or 2 depending on the B value as is required:

Pr(Z = B | θ) = |(2, B)| θ^β (1 − θ )^2−β

This expression now works for a sample size of 2. How do we expand this to N individuals?

Study These Flashcards

Suppose we have a sample size of N individuals, then the probabilities would look like this:

Again we recognise a pattern in the coefficients of each expression {1,3,3,1}, which corresponds exactly to the polynomial coefficients for the expansion of (x+1)^3. Hence we can write the likelihood using binomial expansion notation:

Pr(Z = B | θ) = |(3, B)| θ^β (1 − θ )^2−β

We recognise a pattern in the likelihoods of expressions (4.20) and (4.22), meaning that for a sample size of N, the likelihood is given by:

Pr(Z = B | θ) = |(N, B)| θ^β (1 − θ )^2−β

What is this expression:
Pr(Z = B | θ) = |(N, B)| θ^β (1 − θ )^2−β
Known as?

Study These Flashcards

The binomial probability distribution

With this model, the probability of generating our data using our model is extremely small. What does this tell us?

There is something wrong with our model. It could be that the actual disease prevalence is much higher than the 1% we assumed. The assumption of independence could also be violated, for example if we sampled households rather than individuals. It is difficult to diagnose what is specifically wrong with our model without further information. However, it does suggest that we should adjust one or more of our assumptions and reformulate the model to take these into account. We must never simply accept that a model is correct. A model is only as good as its ability to recapitulate real-life data, which is lacking in this case.

The term 'random sample' is just a shorthand for an independent and identically distributed sample of data. Bayesians, however, often assume something else. What is this?

Bayesians assume a (slightly) weaker condition that still means the overall likelihood is the product of individual likelihoods in many situations. Suppose that we have a sequence of random variables representing the height of three individ- uals: {H1,H2,H3}. If this sequence is equally as likely as the reordered sequence, {H2,H1,H3}, or any other possible reordering, then the sequence of random variables is said to be exchangeable.

If a sample is random, is it necessarily exchangeable?

The assumption of random sampling is stronger than that of exchangeability, meaning that any random sample is automatically exchangeable. However, the converse is not necessarily true. (Drawing balls without replacement from an urn containing three red and three blue balls is exchangeable but not from a random sample)

How is this difference in exchangeability and random sampling practically relevant?

Sometimes we cannot assume to have a random sample of observations for similar reasons to the urn example. However, a brilliant theory originally developed by Bruno de Finetti means a sample behaves as a random sample so long as it is exchangeable

What does this theory regarding exchangeability require in order to be effective?

Technically, this requires an infinite sample of observations, but for a reasonably large sample, this approximation is reasonable.

What is the function of maximum likelihood?

Previously we assumed we knew the prevalence of disease, θ, in the population. In reality, we rarely know such a thing. Indeed, it is often the main focus of statistical modelling to estimate such parameters. The Frequentist approach to estimation is known as the method of maximum likelihood

What is the principle of maximum likelihood estimation?

First, we assume a likelihood using the logic described earlier. We then calculate the parameter values that maximise the likelihood of obtaining our data sample.

Consider our disease prevalence example again. Suppose in a random sample of 100 individuals, 10 are disease-positive, meaning the overall likelihood is given by: L(θ | X = 10, N = 100) = |(100, 10)|.θ^10.(1-θ)^90 What makes this expression a likelihood?

Since we vary θ and hold the data constant

How do we maximise the likelihood of this expression? L(θ | X = 10, N = 100) = |(100, 10)|.θ^10.(1-θ)^90

We then calculate the value of θ which maximises the likelihood. To maximise a function, we need to find the point at which its gradient is 0 – in other words, where the function stops either increasing or decreasing. The correct way to do this is by differentiation. We could differentiate expression and set the derivative equal to 0, and rearrange the resultant equation for θ. However, it is simpler to first take the log of this expression, before we differentiate it.

What allows us to take the log for this purpose? (maximising the likelihood)

We can do this because the properties of the log transformation ensure that the function is maximised at the same value of θ

How would we take the log of the following expression then? L(θ | X = 10, N = 100) = |(100, 10)|.θ^10.(1-θ)^90

l(θ | X = 10, N = 100) = log(|(100, 10)|) + 10log(θ) + 90log((1-θ))

What is this now known as?

l(θ | data) is the log-likelihood.

What is the next step after obtaining the log likelihood? Apply this to the expression

We now differentiate the log-likelihood and set the derivative to zero: ∂l / ∂θ = 10 / ^θ - 90 / 1 - ^θ = 0 and obtain the maximum likelihood estimate, ^θ = 1 / 10

Why does this estimator make sense intuitively?

The value of the parameter which maximises the likelihood of obtaining our data sam- ple occurs when the population disease prevalence exactly matches the diseased proportion in our sample. In general, if we observe a number β of disease-positive individuals in a sample size of N, then the maximum likelihood estimator equals the diseased proportion in our sample: ˆθ = β / N

What about the log-likelihood means that the function will be maximised at the same output value as the original likelihood?

The log function is monotonically increasing, meaning that as x increases, the function value always increases. For the other function, increases in x do not necessarily cause increases in the function value; it is non-monotonically increasing. The monotonicity of the log-likelihood means that the function will be maximised at the same input value as the original likelihood. (See figures)

Summarise the procedure of obtaining maximum likelihood in four steps

1 Find the density of a single data point. 2 Calculate the joint probability density of all data points, by multiplying the likelihood from the individual data points together (if the data are independent). 3 Take the log of the joint density to produce the log-likelihood function. 4 Maximise the log-likelihood by differentiation.

We now know how to calculate point estimates of parameters using the method of maximum likelihood. However, at the moment we are unable to make any conclusions about the population. Why is this?

This is because we do not know whether our estimated value is due to picking a weird sample or because it is close to the true value.

How do frequentists tackle this issue of inference in maximum likelihood?

Frequentists tackle this issue by examining the likelihood function near the maximum likelihood point estimate. If the likelihood is strongly peaked (narrower peak; see figures) near the maximum likelihood estimate, then this suggests that only a small range of parameters could generate a similar likelihood.

How do we measure the peakedness of a maximum likelihood? Why is this?

We measure the peakedness in the likelihood by calculating the magnitude of its second derivative at the maximum likelihood estimates. This is because the first derivative represents the gradient, whereas the second derivative represents the rate of change of the gradient – a measure of curvature.

How should we interpret the output of this calculation (curvature)

The more curved the likelihood, the more confident we are in our estimates and any conclusions drawn from them

Chapter 4: Likelihoods Flashcards

(44 cards)