week 6 Flashcards
In the context of Bayesian asymptotics, what does it mean for a posterior distribution π(θ|xn) to be consistent?
It means that as the sample size n increases, the posterior distribution concentrates its mass arbitrarily close to the true parameter value θ₀. Formally, π(θ|xn) converges weakly to δ_{θ₀}(θ) (a point mass at θ₀) P_{θ₀}-almost surely.
State the intuitive meaning of posterior consistency using neighborhoods.
For any neighborhood U(θ₀) around the true parameter θ₀, the posterior probability of θ lying within that neighborhood converges to 1 as n → ∞ (P_{θ₀}-a.s.). That is, ∫_{U(θ₀)} π(θ|xn)dθ → 1.
What condition does Doob’s Theorem require for posterior consistency?
It requires the statistical model to be identifiable (i.e., P_θ ≠ P_{θ’} if θ ≠ θ’).
What does Doob’s Theorem guarantee about posterior consistency?
It guarantees that the posterior distribution π(θ|xn) will be consistent for all θ₀ in a set Θ₀ that has full measure under the prior π₀ (i.e., ∫_{Θ₀} π₀(θ)dθ = 1). Consistency holds except possibly on a set of θ values with prior measure zero.
What is a practical criticism of Doob’s Theorem?
It guarantees consistency only up to a set of prior measure zero. This set could still be large or relevant (e.g., all non-zero values) if the prior is concentrated elsewhere (e.g., at zero), making the ‘technical’ consistency potentially misleading.
How can posterior consistency be checked using the posterior mean and variance?
If E[θ|xn] → θ₀ and Var[θ|xn] → 0 as n → ∞ (P_{θ₀}-a.s.), then the posterior is consistent.
What does the convergence in Total Variation (TV) distance between posteriors derived from different priors (π₁ and π₂) imply asymptotically?
It implies that ||π₁(θ|xn) - π₂(θ|xn)||TV → 0 as n → ∞ (P{θ₀}-a.s.), provided both priors assign positive mass to the true θ₀. This shows the influence of the prior diminishes as sample size increases.
What is the typical limiting distribution of the Maximum Likelihood Estimator (MLE) θ̂_n^ML in frequentist statistics?
√n (θ̂_n^ML - θ₀) converges in distribution to a Normal distribution N(0, I(θ₀)⁻¹), where I(θ₀) is the Fisher Information matrix at the true value θ₀.
What is the main result of the Bernstein-von Mises (BvM) theorem regarding the asymptotic shape of the posterior distribution?
Under regularity conditions, the posterior distribution π(θ|xn), when properly centered and scaled, converges to a Normal distribution.
According to BvM, what Normal distribution approximates the posterior distribution of θ|xn for large n?
θ|xn ≈ N_p(θ̂_n^ML, [Î_n(θ̂_n^ML)]⁻¹), where θ̂_n^ML is the MLE and Î_n(θ̂_n^ML) is the observed Fisher information matrix evaluated at the MLE.
How does the BvM theorem relate Bayesian credible sets and frequentist confidence intervals?
It implies that for large n, Bayesian credible sets and frequentist confidence intervals (based on the MLE) tend to coincide, suggesting Bayesian inference is asymptotically calibrated from a frequentist perspective.
Does the prior π₀(θ) influence the limiting Normal distribution in the BvM theorem?
No, the prior term vanishes asymptotically, showing the diminishing influence of the prior as data accumulates.
What is the Laplace Approximation used for?
It provides an analytical approximation to the marginal likelihood m(xn) = ∫ L(θ,xn)π₀(θ) dθ, which is often intractable, especially in high dimensions.
Define the ‘energy function’ g(θ) in the context of Laplace Approximation.
g(θ) = -log[L(θ, xn)π₀(θ)], i.e., the negative logarithm of the unnormalized posterior density (kernel).
What is the core idea behind the Laplace Approximation of m(xn)?
Approximate g(θ) by a quadratic function (its second-order Taylor expansion) around its minimum θ̃ (which is the posterior mode), then integrate the resulting Gaussian function analytically.
Let θ̃ be the posterior mode (minimizing g(θ)) and H(θ̃) = ∇²g(θ̃) be the Hessian matrix at the mode. What is the Laplace approximation formula for m(xn)?
m(xn) ≈ exp{-g(θ̃)} * (2π)^{p/2} * det[H(θ̃)]^{-1/2} = L(θ̃, xn)π₀(θ̃) * (2π)^{p/2} * det[H(θ̃)]^{-1/2}
What condition makes the Laplace approximation accurate?
The approximation works well when the posterior distribution is unimodal and well-approximated by a Gaussian distribution, typically occurring for large sample sizes (due to BvM).
What quantity does the Hessian H(θ̃) in the Laplace approximation correspond to?
It is the observed information matrix evaluated at the posterior mode θ̃ (also called the generalized observed information matrix if the prior is included).