14. Bootstrapping Flashcards

1
Q

What is bootstrapping?

A

Process of resampling with replacement from the original data to generate multiple resamples of the same n as the original data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are assumption violations?

A

Make it difficult to draw conclusions from linear models which can impact estimates or inferences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are two possible explanations behind assumption violations?

A

Model misspecification - Violated as model is not correct

  • Failed to include interaction
  • Failed to include a non-linear (higher-order) effect

Non-linear transformations of outcome and/or predictors

  • Often related to non-normal residuals and non-linearity
  • Can be helped by transformation of predictors and outcomes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the generalised linear model used for?

A

When outcomes are not continuous or normally distributed not because of an error in measurement but because they would not be expected to be

e.g. binary variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What solves the issue of poor inferences that violated assumptions can lead to?

A

Bootstrapped inference - creates a more reliable building block for inferences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a ‘good sample’?

A

If a sample of n is drawn at random, it will be unbiased and representative of the whole population

Point estimates from these samples will be good estimated of population parameter (data that describes the entire population)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a sampling distribution?

A

Take a sample size form population and calculate estimate of population parameter

Doing this repeatedly creates sampling distribution

Mean of sampling distribution = Good approximation of population parameter

To quantify sampling variation = can refer to SD of sampling distribution (which is SE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the two possible solutions to getting enough sense of the variability in sample estimates when collecting samples from a population? (explain both processes)

A

Theoretical solution

  • Collect one sample
  • Estimate the standard error using the formula

Bootstrap solution

  • Collect one sample
  • Mimic the act of repeated sampling from the population by repeated resampling with replacement from the original sample
  • Estimate the standard error using the standard deviation of the distribution of resample statistics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a bootstrap distribution?

A

Distribution of statistics following bootstrapping made up of each resample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you get a bootstrap distribution?

A

Start with an initial sample of size n.

Take k resamples (sampling with replacement) of size n, and calculate your statistic on each one.

As k→∞, the distribution of the k resample statistics begins to approximate the sampling distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What size should each bootstrap sample be?

A

The same as the original n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is bootstrap standard error?

A

Bootstrap SE = SD of bootstrap distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a confidence interval?

A

Defines plausible range for population parameter

To estimate need…

  • A confidence level
  • A measure of sampling variability (e.g. SE/bootstrap SE)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a % confidence interval?

A

Across repeated samples, [x]% confidence intervals would be expected to contain the true population parameter value.

So out of 100 samples, 95 would contain true population mean

This is subtly different from saying that we are 95% confident that the true mean is inside our interval. The 95% probability is related to the long-run frequencies of our intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you calculate confidence intervals?

A

68/95/99 Rule:

Sampling distributions become normal, so there are fixed properties of normal distributions

  • 68% density falls within 1 SD of mean
  • 95% of density falls within 1.96 SD of mean
  • 99.7% of density falls within 3 SD of mean

For 95%:

  • Lower bound = mean - 1.96*SE
  • Upper bound = mean + 1.96*SE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do we compute a bootstrap distribution of any statistic (other than the mean)?

A

We can calculate βcoefficients, R2, F-statistics etc.

In each case we generate a resample
Run the linear model
Save the statistic of interest
Repeat this K times
Generate the distribution of K statistics of interest.

17
Q

How is bootstrapping done in r?

A

Boot from the car package

Steps:

  1. Run Model
  2. Load car
  3. Run Boot
  4. See summary results
  5. Calculate confidence interval
18
Q

What does each argument in the function Boot() mean?

A

f = statistics
R = number of bootstrap resamples
ncores = number of cores, indicates whether calculations are performed in parallel

19
Q

How do you compute standard error of a bootstrap?

A

Sigma(SD) / Square root of n