GLM and confidence intervals Flashcards

Question 1

Q

linear models

Answer

A

a statistical model is an equation
- summarizes, represents and predicts values of a variable or variables
simple line that summarizes the relationship between x and y

Question 2

Q

linear model equation

Answer

A

y= b0 + b1x

b0 -> intercept
b1 -> slope
- regression coefficients

Question 3

Q

linear model predict

Answer

A

for any value of xi, we can use the linear model to predict the value of yi (y-hat i)

y- hat i = b0 + b1xi

Question 4

Q

residual

Answer

A

difference between a score and the value predicted by the model

error= yi - y-hat

Question 5

Q

GLM components

Answer

A

model and error

yi= model + errori
yi= b0 +b1xi + errori

model predicts value of y-hat

error tells us the difference between each score, and value predicted by the model

error= yi - model
error= yi - y-hat
error = y- (b0 +b1xi)

Question 6

Q

general linear model equation

Answer

A

equation
yi = (b0 +b1xi) + errori

writing model without error is used to predict values of yi

error= yi - y-hat

yi= model + error
yi= y-hat + error

Question 7

Q

GLM fitted to a dataset of scores from a single variable

Answer

A

with single variable (y) there is no predictor variable (x) so model is just a constant

yi= b0 + error

want to fit model that will predict values of y use the sample mean (b0= y-bar)

Question 8

Q

different statistical methods for defining the best model for a data set

Answer

A

robust regression (M estimation, Huber regression, Theil-Sen regression)
quantile Regression
Least Absolute shrinkage and Selection Operator

Most common= Least squares error method (LSE)
- also called ordinary least squares method (OLS)

Question 9

Q

LSE

Answer

A

LSE defines the best model as the one which generates the smallest total squared error

total squared error= Sum(yi-y-bar)^2

sum of squared residuals= SSr
SSr= Sum(yi-y-bar)^2
- model that best fits data is model that generates smallest value of SSr

Question 10

Q

why can’t we just calculate sum of the error?

Answer

A

if we sum all deviation scores, we always get a value of zero

avoid this problem by squaring the deviation scores, then summing the squared deviation scores (like standard deviation)

Question 11

Q

estimating the population mean

Answer

A

a sample should be drawn at random from a poulation
- 2 samples from sample population will probably have different individuals, with different scores
- 2 samples from same pop are unlikely to have identical means
- unlikely that mean of a single sample will be identical to the mean of the underlying population
- variation between sample statistics is sampling variation

Question 12

Q

confidence intervals

Answer

A

95% Cl is a range of values that will overlap with the population parameter 95% of the time

Question 13

Q

distribution of sample mean (DSM)

Answer

A

shows the distribution of mean from all possible samples drawn from a population
- assume the sample is representative of the population

Question 14

Q

bootstrapping

Answer

A

if sample is representative of pop we use sample data to create a hypothetical pop which should approx to real pop
- hypothetical pop= same composition of sample but is infinitely large
- every score in sample is equally represented in pop
- for data with n= 50, each score in sample represents 2% of all scores in the hypothetical population

then randomly sample n=50 scores from the hypothetical pop and calc sample mean

repeat sampling as many times to generate many sample means
- plotting histogram of sample means generates a DSM

Question 15

Q

bootstrapped 95% Cl

Answer

A

identifies values of y-bar boundaries at the 2.5% tails

Question 16

Q

reproducibility of simulation -based results

Answer

A

bootstrapping invovles analysis of large numbers of randomly selected sample

if you repeat analysis, you will obtain a different set of randomly-selected samples, so may obtain different results
- bootstrapping is not perfectly reproducible- you get diff results each time you do the analysis

to maximize reproducibility, you can use large numbers of iterations of the simulation
- at least 10,000

Question 17

Q

what does bootstrapping do?

Answer

A

generates a DSM based entirely on the data from your sample
- no additonal assumptions are required
- CLT requires additional assumptions
- bootstrapping is independent and self-contained

Question 18

Q

standard deviation of the sample of y values

Question 19

Q

standard deviation of the distribution of sample means

Question 20

Q

shape of the bootstrapped DSM

Answer

A

normal distribution is defined by two parameters
- mean , standard deviation

Question 21

Q

impact of Sy on DSM

Answer

A

smaller Sy = skinnier/smaller normal distribution

Question 22

Q

impact of Sy on Sy-bar

Answer

A

larger Sy = larger Sy-bar

Question 23

Q

impact of n on DSM

Answer

A

smaller sample size = less closely the DSM approximates the normal distribution

Question 24

Q

impact of n on Sy-bar

Answer

A

smaller n = smaller Sy-bar

Question 25

Q

impact of Sy and n on Sy-bar

Answer

A

Sy-bar proportional to Sy/sqrt(n)

when samples are drawn from normally-distributed population, the bootstrapped DSM approximates the normal distribution

for larger samples, the standard deviation of the DSM can be estimated form sample data using :

Sy-bar = Sy/sqrt(n)

Question 26

Q

standard error

Answer

A

for larger samples, the standard deviation of the bootstrapped DSM is equal to standard error

standard deviation between a sample mean (y-bar) and mean of DSM

Question 27

Q

variance sum law

Answer

A

objective of variance sum law: determine variance of Sum(yi), variance of the sum

variance is the average squared deviation score
- variance is an average, we can say that the average squared deviation of each score (yi) from the mean is Sy^2
- Sy^2 is calculated for the sample, variance is a property of each individual score

Question 28

Q

variance of a ‘constant times a random variable’

Answer

A

if you multiply a variable by a constant, the variance of the variable increases by the constant squared

Question 29

Q

deriving standard error

Answer

A

standard deviation of the distribution of sample means
- squaring standard error gives variance of sample means

calculate sample mean from scores using y-bar= (Sum yi)/n

use variance sum law to calculate the variance of Sum(yi)

use variance of a constant times a random variable rule to calculate variance of (Sumyi)/n by multiplying variance of Sumyi by (1/n)^2)

Question 30

Q

DSM from a skewed population

Answer

A

estimate the population mean by generating a bootstrapped DSM

procedure
1. sample expanded to infinite size
2. one sample selected at random, and mean calculated
3. repeated 1,000,000 times to generate million sample means
4. distribution of sample means visualized using a histogram

Question 31

Q

bootstrapped DSM

Answer

A

For larger n, the DSM approximates to the normal distribution even if the underlying poulation does not follow the normal distribution
- (standard deviation is standard error: sy-bar= Sy/sqrt(n))

for smaller n, the DSM still approximates reasonably close the normal distribution but only if the underlying population is normal
- (standard deviation is slightly overestimated by: sy/sqrt(n))

for smaller n, if the underlying pop is not normally-distributed the DSM can’t evicted quite markedly from the normal distribution

Question 32

Q

central limit theorem

Answer

A

alternative approach used to estimate the DSM from sample data

when samples are large, n>30, the distribution of sample means will be normal regardless of the distribution of the scores in the underlying population. the standard deviation of the distribution of sample means will equal the standard error

CLT allows estimation of the DSM without performing thousands of simulations, so was a far more efficient approach prior to modern computing

Question 33

Q

using CLT

Answer

A

uses normal distribution to estimate the shape of the DSM
- normal distribution is defined by two parameters (mu and sigma)

CLT uses sample mean (y-bar) to estimate the mean of DSM
CLT uses sample standard deviation (Sy) and sample size (n) to estimate the standard deviation of DSM

CLT therefore defines the DSM by assuming the shape, and estimating two parameters from sample data

Question 34

Q

CLT with smaller samples

Answer

A

with normally distributed samples of less than 30, CLT assumes that the DSM approximates to the t distribution with df= n-1
- smaller samples causes Sy-bar to be less accurate as an estimate of the standard deviation of the DSM
- using the t distribution compensates for this reduced accuracy

if n<30 and scores are not normal distribution
- don’t use CLT
- DSM can deviate too dramatically from t distribution
- conclusions based on assumption that the DSM does approx. to the t distribution can be erroneous

Question 35

Q

CLT and CI

Answer

A

estimate DSM using CLT, use DSM to determine 85% CI’s

procedure:
1. determine which distribution is appropriate (normal, t, number of df)
2. use y-bar, Sy, and n to define the mean (y-bar) and standard deviation of the DSM (Sy-bar)
3. 95% CI is the boundaries of the central 85% of means from the DSM

instead of visualizing the DSM, we can find the t-crit (critical value of t at the boundary of the 2.5% tails)

95% CI = y-bar +/- (tcrit x Sy-bar)

Question 36

Q

sample size and accuracy of mu

Answer

A

sample size increase = accuracy of estimate of mu increase
- due to the reduction in Sy-bar and change in t-crit

Question 37

Q

comparison between bootstrapping and CLT

Answer

A

n=100
- both methods for DSM generation generate almost identical results

with large n
- both approaches have similar results even if population of scores is skewed

bootstrapped 95% CI is narrower, especially for smaller n

bootstrapped 95% CI generated from skewed population is more likely to be asymmetrical, especially for lower n
- CLT 95% CI are always symmetrical as they are generated by assuming normal distribution

Question 38

Q

advantages of bootstrapping

Answer

A

bootstrapping requires fewer assumptions
- if n< 30, CLT relies on the assumption that the scores are normally-distributed
- assumption may be incorrect, resulting in DSM that is incorrect

bootstrapping makes fuller use of the data
- every score in the sample is represented within the hypothetical pop used for generating the DSM
- CLT uses all scores to estimate Sy, then uses this value to define the standard deviation of DSM
- reducing all scores to a single summary statistic results in loss of information, potentially reducing the accuracy of the DSM
- Potential error associated with the estimate of Sy-bar is compensated for by making the 95% CI wider through use of t distribution

bootstrapping is easier to understand
- CLT and normal distribution is complex math
- bootstrapping only needs mean, and calculate it a lot

Question 39

Q

advantages of CLT

Answer

A

estimating DSM using CLT requires one calculation
- bootstrapping is thousands of computations

CLT is central to classical stats while bootstrapping is more recent
- bootstrapped methods haven’t been developed for all statistical analyses or software (not in Excel)

CLT results are consistent
- bootstrap analysis involves randomly sampling scores, so results will be different every time bootstrapping is repeated (difference will not be large)

Question 40

Q

bootstrapping vs. CLT

Answer

A

advantages of bootstrapping outweigh disadvantages

unlike CLT, bootstrapping doesn’t rely on assumption of normality where n<30, less likely to generate incorrect results if assumption is incorrect

Question 41

Q

skewed population distribution

Answer

A

if population heavily skewed, better to use bootstrapping compared with the CLT if n< 30

if poulation is heavily skewed, the mean is not an appropriate choice of statistic (medialn better option of central tendency in skewed data)

generally preferred to transform the data to reduce/remove skew than calc DSM based on this transformed data

Question 42

Q

epi.conf() 95% CI CLT

Answer

A

epi.conf(y, conf.level=0.95)

generates
= sample mean
= standard error
= lower 95% CI boundary
= upper 95% CI boundary