10. Bayesian Statistics Flashcards
Bayes’ Theorem
Elementary Version
P(A|B) = P(A∩B)/P(B)
= P(B|A)P(B) / [P(B|A)P(A)+P(B|A^c)P(A^c)]
Bayes’ Theorem
Events as Discrete Random Variables
P(X=x|Y=y) = P(Y=y|X=x)P(X=x) / [ΣP(Y=y|X=t)P(X=t)]
Bayes’ Theorem
Events as Continuous Random Variables
f(x|y) = f(y|x)f(x) / [∫f(y|t)f(t)]
Frequentist vs. Bayesian Approach
- to the frequentist, probability is long-run relative frequence
- to the Bayesian, probability is a degree subjective belief
Statistical Inference
- adopt a probability model for data X, distribution of X depends on parameter θ
- use observed value X=x to make decisions about θ
- translate the decision into a statement about the process that generated the data
Parameter Definition
Frequentist vs. Bayesian
- a frequentist defines a parameter as an unknown constant
- a Bayesian defines a parameter as a random variable
Model
Frequentist vs. Bayesian
- frequentist: f(x)
- Bayesian: f(x|θ) OR p(x|θ)
Bayesian Models
- choose a prior distribution to describe the uncertainty in the parameter: π(θ)
- observe data
- use Bayes’ Theorem to obtain a posterior distribution, π(θ|x)
- this posterior distribution could be used as a prior for the next experiment
Influence of the Prior
- the most frequent objection to Bayesian statistics is the subjectivity of the choice of prior
- however for robust data where the model is very good, the influence of the choice of prior should tend to 0 as the sample size increases
How do you determine the posterior distribution?
π(θ|x) = f(x|θ)π(θ) / [∫f(x|t)π(t)dt]
∝ f(x|θ)π(θ)
What can you do with a posterior distribution?
- give a point estimate of θ
- test hypotheses
Decision Theory
-anytime you make a decision you can lose something
-risk = expected loss
-the goal is to make decisions that will minimise risk
d = d(x) = ∈ D
-where d(x) is a decision based on the data and D is the decision space
Decision Space
- the set of all possible decisions that might be made based on the data
- for estimation, D = parameter space
- for hypothesis testing, D = two points
Loss Function
L = L(d(x), θ) ≥ 0
-when X and θ are random, L is a real-valued random variable
Expected Loss
E(L) = E(E(L|X))
= ∫ [∫ L(d(x), θ) dπ(θ)] dP(x)
Bayes’ Decision
- any decision d(x) that minimises the posterior expected loss for all x
- meaning that it also minimises the overall expected loss (i.e. risk)
- this is the theoretical basis for using the posterior distribution
Prior Distribution
Beta Distribution
π(θ) = Γ(α+β)/[Γ(α)Γ(β)] * θ^(α-1) * (1-θ)^(β-1)
-for 00, β>0`
Properties of the Beta Distribution
-defined on [0,1] E(θ) = α/α+β Var(θ) = αβ / (α+β)²(α+β+1) -for α=β=1, the distribution is uniform -can assume a variety of shapes depending on α and β
Conjugate Priors
- for some model and prior combinations, the prior and posterior distributions will be from the same family
- e.g. the beta distribution is a conjugate prior for a Bernoulli model
- conjugate priors are very convenient and exist for many models
Loss Function
Squared Error Loss
-any different functions can be taken for the loss function as long as they satisfy the property that more wrong = greater loss
-e.g. the squared error:
L(d,θ) = k (d-θ)²
-we can drop the proportionality constant
Minimise Expected Loss
-let μ = E(θ|X=x)
-then:
E(L(d,θ) | X=x)
= (d-μ)² + Var(θ|X=x)
-this is minimised when d=μ = E(θ|X=x)
-i.e. Bayes’ estimate under squared error loss is the posterior mean