Topic 5: Confidence intervals and cross-validation Flashcards
What is confidence intervals
Confidence intervals provide a kind of frequentist solution, which is by far the most popular in applied practice. A 95% confidence interval, means that if we were to repeat the sampling process many times, 95% of those intervals would contain the true parameter. It doesn’t assign a probability.
The standard confidence interval is described as:
EQUATION
https://docs.google.com/document/d/1gjbVijBExWH65yJXeWF9GOWI1FAEIRS2p7HGzwYnrvY/edit?tab=t.0
What is neyman’s construction
If we have a parameter θ and we want to evaluate the confidence intervals of the estimate, we can use Neyman’s construction (if we assume a normal distribution).
To evaluate the estimate we use the PDF of the parameter we want to estimate. The confidence interval bounds are found by solving two integral equations:
EQUATION
https://docs.google.com/document/d/1NKJ-IJhIwXx-10xJxqKChJIN5w0nJvwUXuc-oNHe6jE/edit?tab=t.0
This integrates the area in the right and left tail to equal 0.025. This way we get our confidence intervals/lower and upper bound for our estimate.
What is the percentile method
The goal is to automate the calculation of the confidence intervals, so given the bootstrap distribution of a statistical estimator $\hat θ$, we want to automatically produce an appropriate confidence interval for the UNSEEN parameter θ (the true one that is fixed, but unknown). One of four methods to do this is by using the percentile method (bootstrap confidence interval).
It uses the shape of the bootstrap distribution to improve upon the standard intervals: $\hat θ ± \hat{se}$
SOOO, having generated B replications, $\hat θ^_1, \hat θ^_2, …, \hat θ^{*B}$, either nonparametrically or parametrically. We use the percentiles of their distribution to define the percentile confidence limits. Those percentiles could be 0.025 and 0.975.
It’s also transformation invariant
Explain BC and BCa
methods
If we do not have a direct access to unbiassed, constantly varying bootstrapped statistics, we may however remove the bias and take into account the changing variance.
BC and BCa properties:
Imagine we have B bootstrap samples: hat θ^_1, hat θ^_2, …, hat θ^{*B}$. Each one is a replication of our statistic hat θ.
p_{0} is the proportion of bootstrap replications less than or equal to hat θ, also defined as:
EQUATION
https://docs.google.com/document/d/1zuBDa_s7oj7C3kb9rNb_aLqhfHUVIU6XMblb5DINgr8/edit?tab=t.0
The bias corrected value will be:
EQUATION
BC properties:
- Assumption of constant standard error
- Correct probabilities in the Fisherian sense.
- Note that there is no additional computational burden.
BCa properties:
BC is not a second order accuracy because it assumes constant standard error, to combart that BCa uses another factor: the acceleration constant.
$\alpha$ (the accelation factor) can be estimated with jackknife:
- Why Jackknife for Estimating ‘a’:
The jackknife is particularly good for this because:
- It provides a way to estimate the rate of change in the statistic when systematically removing one observation at a time
Explain the differences between Neyman’s construction, percentile method, as well as BC and BCa methods
Neyman’s Construction: When you have a well-specified parametric model (e.g., normal distribution).
Percentile Method: When the population distribution is unknown but you need a quick, simple solution.
BC Method: When bias is suspected, but variance seems stable.
BCa Method: When both bias and variance issues exist, and you want the most accurate confidence intervals.
Key Difference:
Neyman is parametric and theoretical.
Percentile is non-parametric and empirical.
BC adjusts bias; BCa adjusts bias and variance.
Application of confidential intervals
- Estimate population parameters (e.g., mean, proportion).
- Test hypotheses (e.g., rejecting null hypothesis).
- Inform decisions in business and policy.
- Evaluate and compare statistical models.
- Assess risk in finance and forecasting.
- Communicate uncertainty in scientific and industrial studies.
Key Idea: Confidence intervals provide a range of plausible values for an unknown parameter, quantifying uncertainty based on sampling behavior.
Describe Bayes’ credibility intervals
The Bayes’ credibility intervals represent the central region of the posterior distribution where the parameter lies with a given probability (e.g. 95%) based on both the data and the prior distribution.
Given a one-parameter family of densities $f_θ(\hat θ)$ and a prior density $g(θ)$, Bayes’ rule produces this posterior density of $\theta$:
EQUATION
https://docs.google.com/document/d/1Bioh43SuRXaKNT057eScnJQ4QPAXTLznMJHRlvGS3tQ/edit?tab=t.0
The Bayes’ 0.95 credible interval $C(θ|\hat θ)$ spans the central 0.95 region of the posterior density, $g(θ|\hat θ)$:
EQUATION
This will make sure the posterior probability is divided evenly to each tail region, e.g. 0.025 in each tail region.
We are interested in matching priors (Bayesian priors) so that the credible intervals approximate the Neyman confidence intervals, and have a frequentist confidence interval in terms of the coverage probability (we want frequentist “we are 95% confident that $\hat \theta$ will fall between these intervals, when repeating the experiment many times” not “$\hat \theta$ is between these intervals with a probability 95%”)
An example of this could be the Jeffrey’s prior as it provides a generally accurate matching prior for one-parameter problems.
It’s difficult when dealing with multiparameter families, we must remove the nuisance parameters
Describe confidence density
Posterior densities with matching prior have a frequentist correspondent: Confidence density.
Confidence density is a statistical concept that provides a continuous representation of confidence intervals for a parameter. Think of it like this - instead of just having one specific interval (like a 95% confidence interval), the confidence density shows you ALL possible confidence intervals at once through a smooth curve.
Confidence density describes how our confidence changes across different parameter values
Describe prediction accuracy estimation
A prediction problem typically begins with a training set d consisting of N pairs (x_i, y_i):
EQUATION
https://docs.google.com/document/d/1JT4Giu4ESd8THXRaH5R9oIA-X6UqWwsoHUiJKQX-Vh8/edit?tab=t.0
Based on the training set, we make a prediction rule, r_d(x), such that the prediction $\hat y$ is produced for any point x in the predictor’s sample space X.
The inferential task is to assess the accuracy of the rule’s predictions.
To quantify the prediction error of a prediction rule $r_d(x)$, it requires the specification of the discrepancy, D(y, \hat y) between a prediction $\hat y$ and the actual response $y$. The two most common choices are squared error (regression) and classification error (classification):
For the purpose of error estimation, suppose that pairs $(x_i, y_i)$ in the training set d have been obtained by random sampling from some probability distribution F
The true error rate, $Err_d$ of rule r_d(x) is the expected discrepancy of $\hat y_0 = r_d(x_0)$ from $y_0$ given a new pair $(x_0, y_0)$ drawn from F independently of d.
EQUATION
D is held fixed in the expectation, only $(x_0, y_0)$ are varying.
Describe cross-validation and the model-based estimators for prediction error
The ideal remedy, would be to have an independent validation set (or test set).
However, there is always not an access to large, independent validation set.
Cross-validation attempts to mimic $Err_{val}$ without the need for a large validation set.
Model-based
A narrower (but more efficient) model-based approach was the second, emerging in the form of Mallows’ Cp estimate and the Akaike information criterion (AIC).
Differences:
- Cross-Validation: Actually splits and retests on unseen data
- Model-Based: Uses training data once + mathematical adjustment
Describe: apparent error, overall prediction error, and covariance
penalty
Apparent error:
The apparent error is the average discrepancy between the true response, y_i and the predicted response, $\hat y_i$, on the training data, and is given by this:
EQUATION
https://docs.google.com/document/d/1n7OTNoz1B9hrw1Q4xDpDF3P1ZqQGrY3LRTNDOr7xEYA/edit?tab=t.0
On average, the apparent error, err_i underestimates the true prediction error Err_i by the covariance penalty (which makes sense since the covariance, $cov(\mu_i, y_i)$, measures the amount by which $y_i$ influences its own prediction $\hat \mu_i)$
Overall prediction error:
Overall prediction error measures the average discrepancy between the predicted values, $\hat y_i$, and the true values, $y_i$, across the entire population or distribution of data points, not just the training data.
The overall prediction error is the average:
EQUATION
It represents the prediction error.
Covariance penalty:
Covariance penalties can be used for prediction error estimation, and they’re parametric. They require probability models, and they’re less noisy than cross-validation.
We can evaluate the prediction accuracy using the apparent error or the overall prediction error
The covariance penalty approach treats the prediction error estimation in a regression framework.
Describe: Mallow’s Cp, SURE, and Akaike information criterion
Common Foundation
All three criteria (Mallows Cp, SURE, and AIC) are methods to help us select the best model while balancing model complexity against fit. They’re all trying to estimate prediction error.
Mallows Cp:
- Specifically designed for linear regression
- Interpretation: Lower Cp values indicate better models
- Key insight: It’s an estimate of the standardized prediction error
SURE (Stein’s Unbiased Risk Estimate):
- More general than Mallows Cp
- Can be applied to any estimator, not just linear regression
- Key insight: Provides unbiased estimate of mean squared error without knowing the true parameter
AIC (Akaike Information Criterion):
- Most general of the three
- Works for any likelihood-based model
- Key insight: Approximates Kullback-Leibler divergence between model and true distribution)
When to Use Each:
- Mallows’ Cp: Linear regression with normal errors
- SURE: Non-linear models where you can compute derivatives
- AIC: Any model where you can compute the likelihood
Describe model-based estimators for prediction error (Mallow)
Mallow’s Cp:
The $C_p$ statistic is defined as a criteria to assess fits when models with different numbers of parameters are being compared.
If the model is correct, then the $C_p$ will tend to be close to or smaller than $p$. Therefore, a simple plot of $C_p$ versus $p$ can be used to decide amongst models.
Strengths:
✅ Simple to calculate for linear regression models.
✅ Intuitively balances fit and complexity.
Weaknesses:
❌ Requires a good estimate of $\sigma^2$.
❌ Limited to linear regression; not easily generalized.
It’s used for model selection across a variety of statistical models, especially in likelihood-based models. It balances the model fit via the likelihood with a penalty for model complexity.
Describe model-based estimators for prediction error (SURE)
SURE:
It estimate the prediction error for models under the assumption of Gaussian noise. It provides an unbiased estimate of the true prediction error by correcting the apparent error.
You can adjust the squared error by adding a penalty proportion to the sensitivity of the model predictions to the data.
Strengths:
✅ Works well for Gaussian noise models.
✅ Provides an unbiased estimate of prediction error.
Weaknesses:
❌ Assumes Gaussian noise, limiting its generalizability.
❌ Trace calculation can be computationally expensive in complex models
Describe model-based estimators for prediction error (AIC)
AIC:
A lower AIC value indicates a better balance between the model fit and the complexity. It also penalises overfitting by adding $2p$ to the log-likelihood.
Strengths:
✅ Applicable to a wide range of models (not just linear or Gaussian).
✅ Suitable for model comparison and selection.
Weaknesses:
❌ Assumes the correct model is among the candidates.
❌ Sensitive to sample size; may favor overly complex models in small datasets.