Topic 1: Bayesian inference Flashcards
Describe Bayes rule within statistical inference
We can define Bayes rule as follows:
xxxxEquationxxxx
(https://docs.google.com/document/d/1kvjf1mF_7ckvG-pV5z4nK6ysVgTVbRkeG7iJ0PAfWf4/edit?tab=t.0)
Here, g(mu|x) denote the posterior density
g(mu) prior density
f(x) marginal density of x
f_mu(x) is probability density function
xxxxEquationxxxx (https://docs.google.com/document/d/1kvjf1mF_7ckvG-pV5z4nK6ysVgTVbRkeG7iJ0PAfWf4/edit?tab=t.0)
L_x(mu) is the likelihood function, this → f_mu(x) with x fixed and mu varying, becomes the likelihood function with these conditions
c_x = constant, ensures probabilities sum to 1, we can ignore constants that only depend on x (not μ)
Describe Frequentist and Bayesian interpretation of probability
Frequentist:
- Often real-world scenarios, as it lacks prior information
- Describes the data, not the parameter (mu is fixed, x varies)
- Assigns probability to data, not parameters
Bayesian:
- Often the most intuitive
- Describes the parameters, not the data (mu varies, x is fixed)
- Assigns probability to parameters, not data
Describe the use of prior probability densities to characterise prior information
We can use prior probability densities, such g(μ) to explain g(mu|x). Such prior information can be many things, but one approach is with uninformative priors:
Describe uninformative priors
If have no prior information, but we want to solve it using Bayes’ rule, we use uninformative priors.
Uninformative in this case implies that no bias will be introduced into the final results, but rather that the data has the strongest influence on the conclusions.
Describe the different methods to make uninformative priors
- Laplace’s Prior Has Issues: It doesn’t handle parameter transformations well.
- Jeffreys’ Prior Fixes This: By using Fisher Information, Jeffreys’ prior adjusts for transformations and remains valid across parameter spaces
Laplace’s principle of insufficient reason: We will treat the parameters as equal, making them uniform prior distributions. It can’t always be used. Imagine we have a uniform prior distribution θ, but when transformed γ = e^θ, will NOT BE uniform, therefore, it can’t be used
Jeffrey’s priors: Relies on Fisher information (kind of frequentist). When transforming g^Jeff(µ) using Fisher, to g^Jeff(µ) using the variance, which the Fisher information approximately equals to, we get a correct transformation, making Jeffrey’s prior superior
https://docs.google.com/document/d/18z0CI-E7452ZgdiyS_hBn6mLYblkQiSE_7a-lwUDPyY/edit?tab=t.0
Flat prior: A flat prior assign equal probability density across the parameter space.
Describe the Frequentist and the Bayesian ways of inference
Frequentist Approach:
- Method Choice:
Must choose specific statistical methods for each problem
Example: t-test for comparing means, chi-square for categorical data - Balance Between Distributions:
Methods need to work across different probability distributions
Like having a Swiss Army knife that should work okay in many situations - “Hoping to do well”:
Methods designed to work reasonably well regardless of true parameter θ
Like building a car that performs okay in all weather conditions - Multiple Estimators:
Different statistical tools for different questions
Example: Using p-values for hypothesis testing but confidence intervals for estimation - Scientific Objectivity:
Claims to be more objective because it doesn’t use prior beliefs
Relies purely on data and predefined methods
Bayesian Approach:
- Prior Required:
Must specify your beliefs before seeing data
Example: Based on previous studies, you might believe a drug’s effect is normally distributed around 30% - Complete Solution:
The posterior distribution tells you everything about your parameter
Like having a full map rather than just directions to one place - “All in” with Prior:
Your results depend on your prior beliefs
If your prior is wrong, results might be misleading - Unified Framework:
One approach (posterior distribution) answers all questions
Can get point estimates, intervals, and probabilities all from the same posterior - Subjectivity Acknowledged:
Openly incorporates prior beliefs
Results depend on quality of prior information
More subjective when good prior information isn’t available
Key Difference in Practice:
Frequentist: “Based on this data, if we repeated this experiment many times, we’d expect…”
Bayesian: “Given our prior knowledge and this data, here’s what we believe about…”
Describe empirical Bayes
Insurance company: The insurance company wants to know for how many claims each policy holder will do in the next year. We assume we don’t know the priors, and after rewriting it, we can get to the Robbins’ formula.
EQUATION
https://docs.google.com/document/d/1LocITOV6qyAatci2yiokJxQtguqXhk_6WtxO7Sh8Ks4/edit?tab=t.0
Bayesian estimation can be approximated empirically using observed data when dealing with similar observations.
Describe the difference between subjective and objective Bayesian inference
Objective Bayesian:
Frequentists criticize objective Bayes analysis for weak ties to traditional accuracy standards and reliance on large-sample assumptions. Its effectiveness hinges on the choice of prior, raising doubts about ignoring factors like stopping rules or selective inference without real-world grounding.
Still, objective Bayes methods remain popular for their simplicity in tackling complex problems. However, better theoretical justification and comparisons with other methods are needed to ensure reliability.
Subjective Bayesian:
Subjective Bayesianism is particularly appropriate for individual decision making, say for the business executive trying to choose the best investment in the face of uncertain information.
From the subjectivist point of view, objective Bayes is only partially Bayesian:
- It employs the Bayes’ theorem, but without doing the hard work of determining a convincing prior distribution
- This introduces frequentist methods, along with Jeffrey’s priors (because Jeffrey’s priors uses Fisher information to make the priors, which is a frequentist approach)
Describe what conjugate priors are
Its a prior distribution that, when combined with the likelihood function (Lx(µ)), gives you a posterior distribution in the same family as the prior.
Its main advantage is its computational efficiency, you can just update parameters.
Describe Bayes factor and its relation to BIC
The BIC approximation to the Bayes factor B(x), BIC provides a computationally simple way to approximate Bayes Factors.
The Bayes factor B(x) is then ratio of marginal densities. It tells us how much the data favors M₁ over M₀.
Describe BIC for model selection
We have two models, M_0 and M_1.
M_0 is the null hypothesis and M_1 is the general two-sided alternative.
Frequentist: Run a hypothesis test of H_0: µ=0, a lot of times, to choose between M_0 and M_1
Bayesian: We want an evaluation of the posterior probabilities of M_0 and M_1 given x. We require prior probabilities for the two models. We use Bayes’ theorem (in its odds ratio form, to easily compare the two models).
https://docs.google.com/document/d/1_ByZaclwFohk7TIe3QT_0D5xdwJd_4rdx83wHla0fnc/edit?tab=t.0
This will then result in the Bayes factor
https://docs.google.com/document/d/1_ByZaclwFohk7TIe3QT_0D5xdwJd_4rdx83wHla0fnc/edit?tab=t.0
This is if we have the priors, if we don’t require them, we use BIC (Bayesian Information Criterion), when we don’t have informative choices of priors, then we use BIC, which uses the MLE, the degrees of freedom and the sample size
Why is Gibbs sampling and metropolis-Hastings sampling usefull?
Bayesian computations can become really expensive because of the marginal density function in the denominator in the bayesians rule.
- Instead of computing the exact posterior (which requires difficult integrals)
- You get a collection of samples that represent the posterior distribution
This can be done in these two methods Gibbs and metropolis-Hastings sampling
Describe Gibbs sampling
Advantage of Gibbs sampling:
- Simplifies high-dimensional sampling by breaking it into conditionals
- Effective for correlated variables, leveraging dependencies
Challenges of Gibbs sampling:
- Can mix slowly if variables are highly dependent
- Requires knowledge of all conditional distributions
How to works:
Iteratively sample each parameter from its conditional distribution while holding all other parameters fixed.
Describe metropolis-Hastings sampling
Advantages:
- Effective for sampling from high dimensional and complex distributions
- Flexible in choice of proposal distribution
Challenges:
- Requires tuning of the proposal distribution for efficient sampling.
- Slow convergence if the target distribution and proposal distribution are poorly matched.
How to works:
Propose new values for all parameters, then accept or reject based on the ratio of posterior probabilities.