14. Bootstrapping Flashcards
What is bootstrapping?
Process of resampling with replacement from the original data to generate multiple resamples of the same n as the original data
What are assumption violations?
Make it difficult to draw conclusions from linear models which can impact estimates or inferences
What are two possible explanations behind assumption violations?
Model misspecification - Violated as model is not correct
- Failed to include interaction
- Failed to include a non-linear (higher-order) effect
Non-linear transformations of outcome and/or predictors
- Often related to non-normal residuals and non-linearity
- Can be helped by transformation of predictors and outcomes
What is the generalised linear model used for?
When outcomes are not continuous or normally distributed not because of an error in measurement but because they would not be expected to be
e.g. binary variables
What solves the issue of poor inferences that violated assumptions can lead to?
Bootstrapped inference - creates a more reliable building block for inferences
What is a ‘good sample’?
If a sample of n is drawn at random, it will be unbiased and representative of the whole population
Point estimates from these samples will be good estimated of population parameter (data that describes the entire population)
What is a sampling distribution?
Take a sample size form population and calculate estimate of population parameter
Doing this repeatedly creates sampling distribution
Mean of sampling distribution = Good approximation of population parameter
To quantify sampling variation = can refer to SD of sampling distribution (which is SE)
What are the two possible solutions to getting enough sense of the variability in sample estimates when collecting samples from a population? (explain both processes)
Theoretical solution
- Collect one sample
- Estimate the standard error using the formula
Bootstrap solution
- Collect one sample
- Mimic the act of repeated sampling from the population by repeated resampling with replacement from the original sample
- Estimate the standard error using the standard deviation of the distribution of resample statistics
What is a bootstrap distribution?
Distribution of statistics following bootstrapping made up of each resample
How do you get a bootstrap distribution?
Start with an initial sample of size n.
Take k resamples (sampling with replacement) of size n, and calculate your statistic on each one.
As k→∞, the distribution of the k resample statistics begins to approximate the sampling distribution.
What size should each bootstrap sample be?
The same as the original n
What is bootstrap standard error?
Bootstrap SE = SD of bootstrap distribution
What is a confidence interval?
Defines plausible range for population parameter
To estimate need…
- A confidence level
- A measure of sampling variability (e.g. SE/bootstrap SE)
What is a % confidence interval?
Across repeated samples, [x]% confidence intervals would be expected to contain the true population parameter value.
So out of 100 samples, 95 would contain true population mean
This is subtly different from saying that we are 95% confident that the true mean is inside our interval. The 95% probability is related to the long-run frequencies of our intervals.
How do you calculate confidence intervals?
68/95/99 Rule:
Sampling distributions become normal, so there are fixed properties of normal distributions
- 68% density falls within 1 SD of mean
- 95% of density falls within 1.96 SD of mean
- 99.7% of density falls within 3 SD of mean
For 95%:
- Lower bound = mean - 1.96*SE
- Upper bound = mean + 1.96*SE