Chapter 4: Likelihoods Flashcards
Again what is Bayes’ rule cousin? Gun to your head rn say it or you die in front of your wife and kids bruh. WHAT IS BAYES RULE?! SAY IT, SAY IT WITH YOUR CHEST
p(θ | data) = p(data | θ) * p(θ) / p(data)
What is p(data | θ) known as?
the bayesians call this numerator term a likelihood
What does this mean in simple, everyday language? Give an example to demonstrate why this nomenclature is used
Imagine that we flip a coin and record its outcome. The simplest model to represent this outcome ignores the angle the coin was thrown at, and its height above the surface, along with any other details. Because of our ignorance, our model cannot perfectly predict the behaviour of the coin. This uncertainty means that our model is probabilistic rather than deterministic.
Explain the reasoning behind the extra term in the following equation:
Pr(H,H |θ,Model) = Pr(H |θ,Model)× Pr(H |θ,Model)
Model represents the set of assumptions that we make in our analysis. We generally omit this term on the understanding that it is implicit.
Why specifically in bayesian inference do Bayesians insist on calling p(data| θ ) a likelihood, not a probability?
This is because in Bayesian inference we do not keep the parameters of our model fixed. In Bayesian analysis, the data are fixed and the parameters vary. In particular, Bayes’ rule tells us how to calculate the posterior probability density for any value of θ.
Consider flipping a coin whose inherent bias, θ, is unknown beforehand. Suppose that we flip our coin twice and obtain one head and one tail. How can we our our model to predict the probability of this data?
In Bayesian inference, we use a sample of coin flip outcomes to estimate a posterior belief in any value of θ. To obtain p(θ |data) we must compute p(data|θ) in the numerator of Bayes’ rule for each possible value of θ. We can use our model to calculate the probability of this data for any value of θ:
Pr(H,T |θ) + Pr(T,H |θ) = θ(1−θ)+θ(1−θ)
= 2θ(1−θ)
What does this (probability of the data) yield?
This result yields the probability for a fixed data sample (one head and one tail) as a function of θ.
Comment on what we obtain if we graph the probability for a fixed data sample (one head and one tail) as a function of θ.
It might appear that the graph is a continuous probability distribution, but looks can deceive. While all the values of the distribution are non-negative, if we calculate the area underneath the curve we obtain:
S(1,0) 2θ(1 - θ)dθ = 1/3
Which does not equal 1. Thus our distribution is not a valid probability distribution.
What implication does it have that this does not posterior distribution is not valid ?
Hence, when we vary θ, p(data|θ) is not a valid probability distribution. We thus introduce the term likelihood to describe p(data|θ) when we vary the parameter, θ.
What notation is generally used to emphasise that a likelihood is a function of the parameter θ with the data held fixed?
L(θ |data)= p(data|θ)
What is L(θ |data)= p(data|θ) referred to as and why?
We call L(θ |data)= p(data|θ) the equivalence relation since a likelihood of θ for a particular data sample is equivalent to the probability of that data sample for that value of θ.
What reasons are given in the book to eplicitly stating our models (be it statistical, biological, sociological etc)
- To predict.
- To explain.
- To guide data collection.
- To discover new questions.
- To bound outcomes to plausible ranges.
- To illuminate uncertainties.
- To challenge the robustness of prevailing theory through perturbations.
- To reveal the apparently simple (complex) to be complex (simple).
Whenever we build a model what questions should we ask?
Whenever we build a model, whether it is statistical, biological or sociological, we should ask: What do we hope to gain by building this model, and how can we judge its success? Only when we have answers to these basic questions should we proceed to model building.
Bayesians are acutely aware that their models are wrong. At best, these simple abstractions can explain some aspect of real behaviour; at worst, they can be very misleading. With this in mind describe the framework described in the book for building a model (4 steps)
1 Write down the real-life behaviour that the model should be capable of explaining.
2 Write down the assumptions that it is believed are reasonable to achieve step 1.
3 Search Chapter 8 for probability models that are based on these assumptions. If necessary, combine different models to produce a resultant model that encompasses all assumptions.
4 After fitting the model to data, test its ability to explain the behaviour identified in step 1. If unsuccessful, go back to step 2 and assess which of your assumptions are likely violated. Then choose a new, more general, probability model that encompasses these new assumptions.
The goal of an analysis is to estimate a probability, θ, that a randomly chosen individual has a disease. We now calculate the probability of each outcome for our sample of one individual:
Pr(X =0|θ)=(1−θ)
Pr(X =1|θ)=θ.
We want to write down a single rule which yields either of the expressions, dependent on whether X = 0 or X = 1
Give a rule which achieves this, name it and describe why it does
Pr(X =α |θ) = θ^α (1−θ)^1−α
where α ∈ {0,1} is the numeric value of the variable X.
This expression is known as a Bernoulli probability density. It reduces to either of the expressions if the individual is disease-positive or -negative, respectively:
Pr(X=0|θ) = θ^0 (1−θ)^1−0 =(1−θ) Pr(X=1|θ)= θ^1 (1−θ)^1−1 = θ.
How does the graph of this function depend on theta?
For a fixed value of θ, the sum (in the figure in notes, the vertical sum) of the two probabilities always equals 1, and so this expression is a valid discrete probability density. By contrast, when we hold the data X fixed and vary θ, the distribution is continuous, and the area under the curve is not 1 (bottom-right panel), meaning this expression is a likelihood.
Now imagine that instead of a solitary individual, we have a sample of N individuals, and want to develop a model that yields the probability of obtaining Z disease cases in this sample.
What assumptions would we have to make in this case? (2) State these assumptions in statistical terms
We assume that one individual’s disease status does not influence the probability that another individual in the sample has the disease (not satisfied if contagious and in close proximity). This assumption is called statistical independence
We also assume that all individuals in our sample are from the same population.
Combining these two assumptions, we say in statistical language that our data sample is composed of independent and identically distributed observations, or alternatively we say that we have a random sample.