Topic 5: Confidence intervals and cross-validation Flashcards
What is confidence intervals
Confidence intervals provide a kind of frequentist solution, which is by far the most popular in applied practice. A 95% confidence interval, means that if we were to repeat the sampling process many times, 95% of those intervals would contain the true parameter. It doesn’t assign a probability.
The standard confidence interval is described as:
EQUATION
https://docs.google.com/document/d/1gjbVijBExWH65yJXeWF9GOWI1FAEIRS2p7HGzwYnrvY/edit?tab=t.0
What is neyman’s construction
If we have a parameter θ and we want to evaluate the confidence intervals of the estimate, we can use Neyman’s construction (if we assume a normal distribution).
To evaluate the estimate we use the PDF of the parameter we want to estimate. The confidence interval bounds are found by solving two integral equations:
EQUATION
https://docs.google.com/document/d/1NKJ-IJhIwXx-10xJxqKChJIN5w0nJvwUXuc-oNHe6jE/edit?tab=t.0
This integrates the area in the right and left tail to equal 0.025. This way we get our confidence intervals/lower and upper bound for our estimate.
What is the percentile method
The goal is to automate the calculation of the confidence intervals, so given the bootstrap distribution of a statistical estimator $\hat θ$, we want to automatically produce an appropriate confidence interval for the UNSEEN parameter θ (the true one that is fixed, but unknown). One of four methods to do this is by using the percentile method (bootstrap confidence interval).
It uses the shape of the bootstrap distribution to improve upon the standard intervals: $\hat θ ± \hat{se}$
SOOO, having generated B replications, $\hat θ^_1, \hat θ^_2, …, \hat θ^{*B}$, either nonparametrically or parametrically. We use the percentiles of their distribution to define the percentile confidence limits. Those percentiles could be 0.025 and 0.975.
It’s also transformation invariant
Explain BC and BCa
methods
BC Method:
Fixes the Bias Problem:
- Tracks how often bootstrap estimates are below original estimate, with p_0
- Shifts the percentiles to account for this bias, could shift to 1% and 99% instead og 2.5% and 97.5%
- Like saying “if 60% of bootstrap estimates are below original, we need to adjust our interval”
BCa Method:
- Does everything BC does PLUS:
Adds “acceleration” (a) to handle:
- Changing variance (heteroskedasticity)
Skewness in the distribution
- Why Jackknife for Estimating ‘a’:
The jackknife is particularly good for this because:
- It provides a way to estimate the rate of change in the statistic when systematically removing one observation at a time
If we do not have a direct access to unbiassed, constantly varying bootstrapped statistics, we may however remove the bias and take into account the changing variance.
BC and BCa properties:
Imagine we have B bootstrap samples: hat θ^_1, hat θ^_2, …, hat θ^{*B}$. Each one is a replication of our statistic hat θ.
p_{0} is the proportion of bootstrap replications less than or equal to hat θ, also defined as:
EQUATION
https://docs.google.com/document/d/1zuBDa_s7oj7C3kb9rNb_aLqhfHUVIU6XMblb5DINgr8/edit?tab=t.0
The bias corrected value will be:
EQUATION
BC properties:
- Assumption of constant standard error
- Correct probabilities in the Fisherian sense.
- Note that there is no additional computational burden.
BCa properties:
BC is not a second order accuracy because it assumes constant standard error, to combart that BCa uses another factor: the acceleration constant.
$\alpha$ (the accelation factor) can be estimated with jackknife:
- Why Jackknife for Estimating ‘a’:
The jackknife is particularly good for this because:
- It provides a way to estimate the rate of change in the statistic when systematically removing one observation at a time
Explain the differences between Neyman’s construction, percentile method, as well as BC and BCa methods
Neyman’s Construction: When you have a well-specified parametric model (e.g., normal distribution).
Percentile Method: When the population distribution is unknown but you need a quick, simple solution.
BC Method: When bias is suspected, but variance seems stable.
BCa Method: When both bias and variance issues exist, and you want the most accurate confidence intervals.
Key Difference:
Neyman is parametric and theoretical.
Percentile is non-parametric and empirical.
BC adjusts bias; BCa adjusts bias and variance.
Application of confidence intervals
- Estimate population parameters (e.g., mean, proportion).
- Test hypotheses (e.g., rejecting null hypothesis).
- Inform decisions in business and policy.
- Evaluate and compare statistical models.
- Assess risk in finance and forecasting.
- Communicate uncertainty in scientific and industrial studies.
Key Idea: Confidence intervals provide a range of plausible values for an unknown parameter, quantifying uncertainty based on sampling behavior.
Describe Bayes’ credibility intervals
A Bayes credible interval is the Bayesian version of a confidence interval. It tells you the range in which a parameter (like the true mean or effect size) is likely to be, given the data and your prior beliefs.
Instead of saying “if we repeated this experiment 100 times, the true value would be in this range 95% of the time” (frequentist), Bayes credible intervals say: “Given the data I have and my prior knowledge, there’s a 95% chance the true value is in this range.
A Bayesian interval that reflects the probability the parameter lies in a certain range, based on both data and prior beliefs.
Old:
The Bayes’ credibility intervals represent the central region of the posterior distribution where the parameter lies with a given probability (e.g. 95%) based on both the data and the prior distribution.
Given a one-parameter family of densities $f_θ(\hat θ)$ and a prior density $g(θ)$, Bayes’ rule produces this posterior density of $\theta$:
EQUATION
https://docs.google.com/document/d/1Bioh43SuRXaKNT057eScnJQ4QPAXTLznMJHRlvGS3tQ/edit?tab=t.0
The Bayes’ 0.95 credible interval $C(θ|\hat θ)$ spans the central 0.95 region of the posterior density, $g(θ|\hat θ)$:
EQUATION
This will make sure the posterior probability is divided evenly to each tail region, e.g. 0.025 in each tail region.
We are interested in matching priors (Bayesian priors) so that the credible intervals approximate the Neyman confidence intervals, and have a frequentist confidence interval in terms of the coverage probability (we want frequentist “we are 95% confident that $\hat \theta$ will fall between these intervals, when repeating the experiment many times” not “$\hat \theta$ is between these intervals with a probability 95%”)
An example of this could be the Jeffrey’s prior as it provides a generally accurate matching prior for one-parameter problems.
It’s difficult when dealing with multiparameter families, we must remove the nuisance parameters
Describe confidence density
What It Is:
- Think of it as a continuous version of confidence intervals
- Instead of just one interval (like 95%), it shows ALL possible intervals
- The height at any point represents how confident we are that it contains the true parameter
Old:
Posterior densities with matching prior have a frequentist correspondent: Confidence density.
Confidence density is a statistical concept that provides a continuous representation of confidence intervals for a parameter. Think of it like this - instead of just having one specific interval (like a 95% confidence interval), the confidence density shows you ALL possible confidence intervals at once through a smooth curve.
Confidence density describes how our confidence changes across different parameter values
Describe prediction accuracy estimation
Key Point: The true error rate measures how well your prediction rule generalizes to new, unseen data from the same distribution.
Basic Setup:
You have N training pairs: (x₁,y₁), (x₂,y₂),…,(xₙ,yₙ)
x_i represents features/predictors
y_i represents the true outcome/response
Together they form your training set ‘d’
Prediction Rule:
Based on your training data, you create a prediction rule r_d(x)
This rule takes any new x and produces a prediction ŷ (y-hat)
Example: In linear regression, r_d(x) might be β₀ + β₁x
Error Measurement:
Two main types of discrepancy D(y,ŷ):
Regression: D(y,ŷ) = (y - ŷ)² (squared error)
Classification: D(y,ŷ) = I(y ≠ ŷ) (0/1 loss)
True Error Rate (Err_d):
Assumes data pairs come from probability distribution F
Takes a new independent pair (x₀,y₀) from F
Predicts ŷ₀ = r_d(x₀)
Calculates expected discrepancy: E[D(y₀,r_d(x₀))]
old:
A prediction problem typically begins with a training set d consisting of N pairs (x_i, y_i):
EQUATION
https://docs.google.com/document/d/1JT4Giu4ESd8THXRaH5R9oIA-X6UqWwsoHUiJKQX-Vh8/edit?tab=t.0
Based on the training set, we make a prediction rule, r_d(x), such that the prediction $\hat y$ is produced for any point x in the predictor’s sample space X.
The inferential task is to assess the accuracy of the rule’s predictions.
To quantify the prediction error of a prediction rule $r_d(x)$, it requires the specification of the discrepancy, D(y, \hat y) between a prediction $\hat y$ and the actual response $y$. The two most common choices are squared error (regression) and classification error (classification):
For the purpose of error estimation, suppose that pairs $(x_i, y_i)$ in the training set d have been obtained by random sampling from some probability distribution F
The true error rate, $Err_d$ of rule r_d(x) is the expected discrepancy of $\hat y_0 = r_d(x_0)$ from $y_0$ given a new pair $(x_0, y_0)$ drawn from F independently of d.
EQUATION
D is held fixed in the expectation, only $(x_0, y_0)$ are varying.
Describe cross-validation and the model-based estimators for prediction error
Differences:
- Cross-Validation: Actually splits and retests on unseen data
- Model-Based: Uses training data once + mathematical adjustment (this would be Mallows C_p and Akaike information criterion (AIC))
How cross validation works:
- - Cross-Validation: A non-parametric method to estimate prediction error by partitioning the data into subsets.
- Leave-One-Out Cross-Validation (LOOCV): Remove one data point at a time, train the model on the rest, and test it on the excluded point. This is repeated for all points.
- K-Fold Cross-Validation: Partition the data into $K$ groups, train the model on $K-1$ folds, and test it on the remaining fold. This is repeated $K$ times.
Besides reducing the number of rule constructions necessary (less rounds of computation), K fold induces larger changes among the different training sets, improving the predictive performance on the rules $r_d(x)$.
Old:
The ideal remedy, would be to have an independent validation set (or test set).
However, there is always not an access to large, independent validation set.
Cross-validation attempts to mimic $Err_{val}$ without the need for a large validation set.
Model-based
A narrower (but more efficient) model-based approach was the second, emerging in the form of Mallows’ Cp estimate and the Akaike information criterion (AIC).
Differences:
- Cross-Validation: Actually splits and retests on unseen data
- Model-Based: Uses training data once + mathematical adjustment
Describe: apparent error, overall prediction error, and covariance
penalty
Apparent error:
The apparent error is the average discrepancy between the true response, y_i and the predicted response, $\hat y_i$, on the training data, and is given by this:
EQUATION
https://docs.google.com/document/d/1n7OTNoz1B9hrw1Q4xDpDF3P1ZqQGrY3LRTNDOr7xEYA/edit?tab=t.0
On average, the apparent error, err_i underestimates the true prediction error Err_i by the covariance penalty (which makes sense since the covariance, $cov(\mu_i, y_i)$, measures the amount by which $y_i$ influences its own prediction $\hat \mu_i)$
Overall prediction error:
Overall prediction error measures the average discrepancy between the predicted values, $\hat y_i$, and the true values, $y_i$, across the entire population or distribution of data points, not just the training data.
The overall prediction error is the average:
EQUATION
It represents the prediction error.
Covariance penalty:
Covariance peanlity will penalise for how complex your model is and the error variance (how similar the errors we make are). We can use the convariance for prediction error estimation.
Covariance penalties can be used for prediction error estimation, and they’re parametric. They require probability models, and they’re less noisy than cross-validation.
We can evaluate the prediction accuracy using the apparent error or the overall prediction error
The covariance penalty approach treats the prediction error estimation in a regression framework.
Describe: Mallow’s Cp, SURE, and Akaike information criterion
Common Foundation
All three criteria (Mallows Cp, SURE, and AIC) are methods to help us select the best model while balancing model complexity against fit. They’re all trying to estimate prediction error.
Mallows Cp:
- Specifically designed for linear regression
- Interpretation: Lower Cp values indicate better models
- Key insight: It’s an estimate of the standardized prediction error
SURE (Stein’s Unbiased Risk Estimate):
- More general than Mallows Cp
- Can be applied to any estimator, not just linear regression
- Key insight: Provides unbiased estimate of mean squared error without knowing the true parameter
AIC (Akaike Information Criterion):
- Most general of the three
- Works for any likelihood-based model
- Key insight: Approximates Kullback-Leibler divergence between model and true distribution)
When to Use Each:
- Mallows’ Cp: Linear regression with normal errors
- SURE: Non-linear models where you can compute derivatives
- AIC: Any model where you can compute the likelihood
Describe model-based estimators for prediction error (Mallow)
Mallow’s Cp:
The Cp statistic is defined as a criteria to assess fits when models with different numbers of parameters are being compared.
If the model is correct, then the Cp will tend to be close to or smaller than p. Therefore, a simple plot of Cp versus p can be used to decide amongst models.
Strengths:
✅ Simple to calculate for linear regression models.
✅ Intuitively balances fit and complexity.
Weaknesses:
❌ Requires a good estimate of $\sigma^2$.
❌ Limited to linear regression; not easily generalized.
It’s used for model selection across a variety of statistical models, especially in likelihood-based models. It balances the model fit via the likelihood with a penalty for model complexity.
Describe model-based estimators for prediction error (SURE)
SURE:
It estimate the prediction error for models under the assumption of Gaussian noise. It provides an unbiased estimate of the true prediction error by correcting the apparent error.
You can adjust the squared error by adding a penalty proportion to the sensitivity of the model predictions to the data.
Strengths:
✅ Works well for Gaussian noise models.
✅ Provides an unbiased estimate of prediction error.
Weaknesses:
❌ Assumes Gaussian noise, limiting its generalizability.
❌ Trace calculation can be computationally expensive in complex models
Describe model-based estimators for prediction error (AIC)
AIC:
A lower AIC value indicates a better balance between the model fit and the complexity. It also penalises overfitting by adding 2p to the log-likelihood.
Strengths:
✅ Applicable to a wide range of models (not just linear or Gaussian).
✅ Suitable for model comparison and selection.
Weaknesses:
❌ Assumes the correct model is among the candidates.
❌ Sensitive to sample size; may favor overly complex models in small datasets.
Describe application of Akaike Information criterion for model selection
- Purpose: Choose the model that minimises AIC, balancing goodness-of-fit and model complexity.
- Interpretation: Lower AIC indicates a better model, but models with very low AIC may overfit.
- Trade-Off: AIC avoids the pitfalls of overfitting by penalising unnecessary complexity, making it useful for selecting models that generalise well to new data.