Chapter 5: Priors Flashcards
What part of Bayes’ theorem comprises the prior?
The p(θ) or the equation: p(θ| data) = p(data |θ) x p(θ) / p(data)
is the prior
Chapter 4 introduced the concept of a likelihood and how this can be used to derive Frequentist estimates of parameters using the method of maximum likelihood.
What does this presuppose? Comment on this
This presupposes that the parameters in question are immutable, fixed quantities that actually exist and can be estimated by methods that can be repeated, or imagined to be repeated, many times
Is it reasonable to assume the parameters in question are fixed?
Gill (2007) indicates this is unrealistic for the vast majority of social science research, It is simply not possible to rerun elections, repeat surveys under exactly the same conditions, replay the stock market with exactly matching market forces or re-expose clinical subjects to identical stimuli.
Furthermore, since parameters only exist because we have invented a model, we should be suspicious of any analysis which assumes they have a single ‘true’ value.
Gelman et al. (2013) suggest that there are two different interpretations of parameter probability distributions
What are these and why are they relevant?
The subjective state of knowledge interpretation, where we use a probability distribution to represent our uncertainty over a parameter’s true value; and the more objective population interpretation, where the parameter’s value varies between different samples we take from a population distribution
In both viewpoints, the model parameters are not viewed as static, unwavering constants as in Frequentist theory
If we adopt the state of knowledge viewpoint, what does our prior probability distribution represent?
The prior probability distribution represents our pre-data uncertainty for a parameter’s true value.
For example, imagine that a doctor gives their probability that an individual has a particular disease before the results of a blood test become available. Using their knowledge of the patient’s history, and their expertise on the particular condition, they assign a prior disease probability of 75%
Alternatively, imagine we want to estimate the proportion of the UK population that has this disease. Based on previous analyses we probably have an idea of the underlying prevalence, and uncertainty in this value. In this case, the prior is continuous and represents our beliefs for the prevalence
See figures for graphs of these examples
If we adopt the state of knowledge viewpoint, what does our prior probability distribution represent?
Adopting the population perspective, we imagine the value of a parameter is drawn from a population distribution, which is represented by our prior.
For the disease prevalence example, we imagine the observed data sample is partly determined by the characteristics of the subpopulations from which the individuals were drawn. The other variability is sampling variation within those subpopulations. Here we can view the individual subpopulation characteristics as drawn from an overall population distribution of parameters, representing the entirety of the UK.
Is the prior always a valid probability distribution?
The prior is always a valid probability distribution and can be used to calculate prior expectations of a parameter’s value.
Why do we even need priors at all?
Bayes’ rule is really only a way to update our initial beliefs in light of data:
intitial belief ={Bayes rule + data}=> new beliefs
Another question that can be asked is: Why can’t we simply let the prior weighting be constant across all values of θ?
Firstly how would we achieve this?
Set p(θ ) = 1 in the numerator of Bayes’ rule, resulting in a posterior that takes the form of a normalised likelihood:
p(data | θ ) / p(data)
This would surely mean we can avoid choosing a prior and, hence, thwart attempts to denounce Bayesian statistics as more subjective than Frequentist approaches. So why do we not do just that? Give two reasons
1) There is a pedantic, mathematical, argument against this, which is that p(θ ) must be a valid probability distribution to ensure that the posterior is similarly valid. If our parameter is unbounded and we choose p(θ ) = 1 (or in fact any positive constant), then the integral (for a continuous parameter) is infinity , and so p(θ ) is not a valid probability distribution.
2) Another perhaps more persuasive argument is that assuming all parameter values are equally probable can result in nonsensical resultant conclusions being drawn.
What if you use a prior which is not a valid probability distribution? Can you still get a valid probability distribution in the posterior?
Even if the prior is not a valid probability distribution, the resultant posterior can sometimes satisfy the properties of one. However, take care using these distributions for inference, as they are not technically probability distributions, because Bayes’ rule requires us to use a valid prior distribution. Here the posteriors should be viewed, at best, as limiting cases when the parameter values of the prior distribution tend to ±∞.
Name a, perhaps more persuasive, more intuitive, argument against normalising the constant by assuming a unity prior
Assuming all parameter values are equally probable can result in nonsensical resultant conclusions being drawn.
Demonstrate that assuming all parameter values are equally probable can result in nonsensical resultant conclusions being drawn with an example of a coin flip
Suppose we want to determine whether a coin is fair, with an equal chance of both heads and tails occurring, or biased, with a very strong weighting towards heads. If the coin is fair, θ = 1 , and if it is biased, θ = 0. Imagine that coin is flipped twice, with the result {H,H}. Assuming a uniform prior results in a strong posterior weighting towards the coin being biased. This is because, if we assume that the coin is biased, then the probability of obtaining 2 heads is high. Whereas, if we assume that the coin is fair, then the probability of obtaining this result is only 1/4 . The maximum likelihood estimate (which coincides with the posterior mode due to the flat prior) is hence that the coin is biased.
Why is the bayesian approach of choosing a prior seen as more honest in the eyes of some bayesians?
All analysis involves a degree of subjectivity, particularly the choice of a statistical model. This choice is often viewed as objective, with little justification for the underlying assumptions necessary to arrive there. The choice of prior is at least explicit, leaving this aspect of Bayesian modelling subject to the same academic examination to which any analysis should be subjected. The statement of pre-experimental biases actually forces the analyst to self-examine and perhaps also reduces the temptation to manipulate the analysis to serve one’s own ends.
Describe the structure of a Bayes’ box with the following example:
Imagine a bowl of water covered with a cloth, containing five fish, each of which is either red or white. We want to estimate the total number of red fish in the bowl after we pick out a single fish, and find it to be red. Before we pulled the fish out from the bowl, we had no strong belief in there being a particular number of red fish and so suppose that all possibilities (0 to 5) are equally likely, and hence have the probability of 1/6 in our discrete prior. Further, suppose that the random variable X∈{0,1} indicates whether the sampled fish is white or red. As before we choose a bernouli prior:
Pr(X = 1| Y = a) = a/5
where α ∈{0,1,2,3,4,5} represents the possible numbers of red fish in the bowl, and X = 1 indi- cates that the single fish we sampled is red.
We start by listing the possible numbers of red fish in the bowl in the leftmost column. In the second column, we then specify our prior probabilities for each of these numbers of red fish. In the third column, we calculate the likelihoods for each of these outcomes using Pr(X = 1| Y = a) = a/5. In the fourth column, we then multiply the prior by the likelihood (the numerator of Bayes’ rule), which when summed yields Pr( X = 1) = 1/2; the denominator of Bayes’ rule that normalises the numerator to yield the posterior distribution is shown in the fifth column. See this table in the doc