GLM and confidence intervals Flashcards

1
Q

linear models

A

a statistical model is an equation
- summarizes, represents and predicts values of a variable or variables
simple line that summarizes the relationship between x and y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

linear model equation

A

y= b0 + b1x

b0 -> intercept
b1 -> slope
- regression coefficients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

linear model predict

A

for any value of xi, we can use the linear model to predict the value of yi (y-hat i)

y- hat i = b0 + b1xi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

residual

A

difference between a score and the value predicted by the model

error= yi - y-hat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

GLM components

A

model and error

yi= model + errori
yi= b0 +b1xi + errori

model predicts value of y-hat

error tells us the difference between each score, and value predicted by the model

error= yi - model
error= yi - y-hat
error = y- (b0 +b1xi)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

general linear model equation

A

equation
yi = (b0 +b1xi) + errori

writing model without error is used to predict values of yi

error= yi - y-hat

yi= model + error
yi= y-hat + error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

GLM fitted to a dataset of scores from a single variable

A

with single variable (y) there is no predictor variable (x) so model is just a constant

yi= b0 + error

want to fit model that will predict values of y use the sample mean (b0= y-bar)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

different statistical methods for defining the best model for a data set

A
  • robust regression (M estimation, Huber regression, Theil-Sen regression)
  • quantile Regression
  • Least Absolute shrinkage and Selection Operator

Most common= Least squares error method (LSE)
- also called ordinary least squares method (OLS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

LSE

A

LSE defines the best model as the one which generates the smallest total squared error

total squared error= Sum(yi-y-bar)^2

sum of squared residuals= SSr
SSr= Sum(yi-y-bar)^2
- model that best fits data is model that generates smallest value of SSr

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

why can’t we just calculate sum of the error?

A

if we sum all deviation scores, we always get a value of zero

avoid this problem by squaring the deviation scores, then summing the squared deviation scores (like standard deviation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

estimating the population mean

A

a sample should be drawn at random from a poulation
- 2 samples from sample population will probably have different individuals, with different scores
- 2 samples from same pop are unlikely to have identical means
- unlikely that mean of a single sample will be identical to the mean of the underlying population
- variation between sample statistics is sampling variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

confidence intervals

A

95% Cl is a range of values that will overlap with the population parameter 95% of the time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

distribution of sample mean (DSM)

A

shows the distribution of mean from all possible samples drawn from a population
- assume the sample is representative of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

bootstrapping

A

if sample is representative of pop we use sample data to create a hypothetical pop which should approx to real pop
- hypothetical pop= same composition of sample but is infinitely large
- every score in sample is equally represented in pop
- for data with n= 50, each score in sample represents 2% of all scores in the hypothetical population

then randomly sample n=50 scores from the hypothetical pop and calc sample mean

repeat sampling as many times to generate many sample means
- plotting histogram of sample means generates a DSM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

bootstrapped 95% Cl

A

identifies values of y-bar boundaries at the 2.5% tails

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

reproducibility of simulation -based results

A

bootstrapping invovles analysis of large numbers of randomly selected sample

if you repeat analysis, you will obtain a different set of randomly-selected samples, so may obtain different results
- bootstrapping is not perfectly reproducible- you get diff results each time you do the analysis

to maximize reproducibility, you can use large numbers of iterations of the simulation
- at least 10,000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what does bootstrapping do?

A

generates a DSM based entirely on the data from your sample
- no additonal assumptions are required
- CLT requires additional assumptions
- bootstrapping is independent and self-contained

18
Q

standard deviation of the sample of y values

A

Sy

19
Q

standard deviation of the distribution of sample means

A

Sy-bar

20
Q

shape of the bootstrapped DSM

A

normal distribution is defined by two parameters
- mean , standard deviation

21
Q

impact of Sy on DSM

A

smaller Sy = skinnier/smaller normal distribution

22
Q

impact of Sy on Sy-bar

A

larger Sy = larger Sy-bar

23
Q

impact of n on DSM

A

smaller sample size = less closely the DSM approximates the normal distribution

24
Q

impact of n on Sy-bar

A

smaller n = smaller Sy-bar

25
Q

impact of Sy and n on Sy-bar

A

Sy-bar proportional to Sy/sqrt(n)

when samples are drawn from normally-distributed population, the bootstrapped DSM approximates the normal distribution

for larger samples, the standard deviation of the DSM can be estimated form sample data using :

Sy-bar = Sy/sqrt(n)

26
Q

standard error

A

for larger samples, the standard deviation of the bootstrapped DSM is equal to standard error

standard deviation between a sample mean (y-bar) and mean of DSM

27
Q

variance sum law

A
  • objective of variance sum law: determine variance of Sum(yi), variance of the sum

variance is the average squared deviation score
- variance is an average, we can say that the average squared deviation of each score (yi) from the mean is Sy^2
- Sy^2 is calculated for the sample, variance is a property of each individual score

28
Q

variance of a ‘constant times a random variable’

A

if you multiply a variable by a constant, the variance of the variable increases by the constant squared

29
Q

deriving standard error

A

standard deviation of the distribution of sample means
- squaring standard error gives variance of sample means

calculate sample mean from scores using y-bar= (Sum yi)/n

use variance sum law to calculate the variance of Sum(yi)

use variance of a constant times a random variable rule to calculate variance of (Sumyi)/n by multiplying variance of Sumyi by (1/n)^2)

30
Q

DSM from a skewed population

A

estimate the population mean by generating a bootstrapped DSM

procedure
1. sample expanded to infinite size
2. one sample selected at random, and mean calculated
3. repeated 1,000,000 times to generate million sample means
4. distribution of sample means visualized using a histogram

31
Q

bootstrapped DSM

A

For larger n, the DSM approximates to the normal distribution even if the underlying poulation does not follow the normal distribution
- (standard deviation is standard error: sy-bar= Sy/sqrt(n))

for smaller n, the DSM still approximates reasonably close the normal distribution but only if the underlying population is normal
- (standard deviation is slightly overestimated by: sy/sqrt(n))

for smaller n, if the underlying pop is not normally-distributed the DSM can’t evicted quite markedly from the normal distribution

32
Q

central limit theorem

A

alternative approach used to estimate the DSM from sample data

when samples are large, n>30, the distribution of sample means will be normal regardless of the distribution of the scores in the underlying population. the standard deviation of the distribution of sample means will equal the standard error

CLT allows estimation of the DSM without performing thousands of simulations, so was a far more efficient approach prior to modern computing

33
Q

using CLT

A

uses normal distribution to estimate the shape of the DSM
- normal distribution is defined by two parameters (mu and sigma)

CLT uses sample mean (y-bar) to estimate the mean of DSM
CLT uses sample standard deviation (Sy) and sample size (n) to estimate the standard deviation of DSM

CLT therefore defines the DSM by assuming the shape, and estimating two parameters from sample data

34
Q

CLT with smaller samples

A

with normally distributed samples of less than 30, CLT assumes that the DSM approximates to the t distribution with df= n-1
- smaller samples causes Sy-bar to be less accurate as an estimate of the standard deviation of the DSM
- using the t distribution compensates for this reduced accuracy

if n<30 and scores are not normal distribution
- don’t use CLT
- DSM can deviate too dramatically from t distribution
- conclusions based on assumption that the DSM does approx. to the t distribution can be erroneous

35
Q

CLT and CI

A

estimate DSM using CLT, use DSM to determine 85% CI’s

procedure:
1. determine which distribution is appropriate (normal, t, number of df)
2. use y-bar, Sy, and n to define the mean (y-bar) and standard deviation of the DSM (Sy-bar)
3. 95% CI is the boundaries of the central 85% of means from the DSM

instead of visualizing the DSM, we can find the t-crit (critical value of t at the boundary of the 2.5% tails)

95% CI = y-bar +/- (tcrit x Sy-bar)

36
Q

sample size and accuracy of mu

A

sample size increase = accuracy of estimate of mu increase
- due to the reduction in Sy-bar and change in t-crit

37
Q

comparison between bootstrapping and CLT

A

n=100
- both methods for DSM generation generate almost identical results

with large n
- both approaches have similar results even if population of scores is skewed

bootstrapped 95% CI is narrower, especially for smaller n

bootstrapped 95% CI generated from skewed population is more likely to be asymmetrical, especially for lower n
- CLT 95% CI are always symmetrical as they are generated by assuming normal distribution

38
Q

advantages of bootstrapping

A

bootstrapping requires fewer assumptions
- if n< 30, CLT relies on the assumption that the scores are normally-distributed
- assumption may be incorrect, resulting in DSM that is incorrect

bootstrapping makes fuller use of the data
- every score in the sample is represented within the hypothetical pop used for generating the DSM
- CLT uses all scores to estimate Sy, then uses this value to define the standard deviation of DSM
- reducing all scores to a single summary statistic results in loss of information, potentially reducing the accuracy of the DSM
- Potential error associated with the estimate of Sy-bar is compensated for by making the 95% CI wider through use of t distribution

bootstrapping is easier to understand
- CLT and normal distribution is complex math
- bootstrapping only needs mean, and calculate it a lot

39
Q

advantages of CLT

A

estimating DSM using CLT requires one calculation
- bootstrapping is thousands of computations

CLT is central to classical stats while bootstrapping is more recent
- bootstrapped methods haven’t been developed for all statistical analyses or software (not in Excel)

CLT results are consistent
- bootstrap analysis involves randomly sampling scores, so results will be different every time bootstrapping is repeated (difference will not be large)

40
Q

bootstrapping vs. CLT

A

advantages of bootstrapping outweigh disadvantages

unlike CLT, bootstrapping doesn’t rely on assumption of normality where n<30, less likely to generate incorrect results if assumption is incorrect

41
Q

skewed population distribution

A

if population heavily skewed, better to use bootstrapping compared with the CLT if n< 30

if poulation is heavily skewed, the mean is not an appropriate choice of statistic (medialn better option of central tendency in skewed data)

generally preferred to transform the data to reduce/remove skew than calc DSM based on this transformed data

42
Q

epi.conf() 95% CI CLT

A

epi.conf(y, conf.level=0.95)

generates
= sample mean
= standard error
= lower 95% CI boundary
= upper 95% CI boundary