GLM and confidence intervals Flashcards
linear models
a statistical model is an equation
- summarizes, represents and predicts values of a variable or variables
simple line that summarizes the relationship between x and y
linear model equation
y= b0 + b1x
b0 -> intercept
b1 -> slope
- regression coefficients
linear model predict
for any value of xi, we can use the linear model to predict the value of yi (y-hat i)
y- hat i = b0 + b1xi
residual
difference between a score and the value predicted by the model
error= yi - y-hat
GLM components
model and error
yi= model + errori
yi= b0 +b1xi + errori
model predicts value of y-hat
error tells us the difference between each score, and value predicted by the model
error= yi - model
error= yi - y-hat
error = y- (b0 +b1xi)
general linear model equation
equation
yi = (b0 +b1xi) + errori
writing model without error is used to predict values of yi
error= yi - y-hat
yi= model + error
yi= y-hat + error
GLM fitted to a dataset of scores from a single variable
with single variable (y) there is no predictor variable (x) so model is just a constant
yi= b0 + error
want to fit model that will predict values of y use the sample mean (b0= y-bar)
different statistical methods for defining the best model for a data set
- robust regression (M estimation, Huber regression, Theil-Sen regression)
- quantile Regression
- Least Absolute shrinkage and Selection Operator
Most common= Least squares error method (LSE)
- also called ordinary least squares method (OLS)
LSE
LSE defines the best model as the one which generates the smallest total squared error
total squared error= Sum(yi-y-bar)^2
sum of squared residuals= SSr
SSr= Sum(yi-y-bar)^2
- model that best fits data is model that generates smallest value of SSr
why can’t we just calculate sum of the error?
if we sum all deviation scores, we always get a value of zero
avoid this problem by squaring the deviation scores, then summing the squared deviation scores (like standard deviation)
estimating the population mean
a sample should be drawn at random from a poulation
- 2 samples from sample population will probably have different individuals, with different scores
- 2 samples from same pop are unlikely to have identical means
- unlikely that mean of a single sample will be identical to the mean of the underlying population
- variation between sample statistics is sampling variation
confidence intervals
95% Cl is a range of values that will overlap with the population parameter 95% of the time
distribution of sample mean (DSM)
shows the distribution of mean from all possible samples drawn from a population
- assume the sample is representative of the population
bootstrapping
if sample is representative of pop we use sample data to create a hypothetical pop which should approx to real pop
- hypothetical pop= same composition of sample but is infinitely large
- every score in sample is equally represented in pop
- for data with n= 50, each score in sample represents 2% of all scores in the hypothetical population
then randomly sample n=50 scores from the hypothetical pop and calc sample mean
repeat sampling as many times to generate many sample means
- plotting histogram of sample means generates a DSM
bootstrapped 95% Cl
identifies values of y-bar boundaries at the 2.5% tails
reproducibility of simulation -based results
bootstrapping invovles analysis of large numbers of randomly selected sample
if you repeat analysis, you will obtain a different set of randomly-selected samples, so may obtain different results
- bootstrapping is not perfectly reproducible- you get diff results each time you do the analysis
to maximize reproducibility, you can use large numbers of iterations of the simulation
- at least 10,000
what does bootstrapping do?
generates a DSM based entirely on the data from your sample
- no additonal assumptions are required
- CLT requires additional assumptions
- bootstrapping is independent and self-contained
standard deviation of the sample of y values
Sy
standard deviation of the distribution of sample means
Sy-bar
shape of the bootstrapped DSM
normal distribution is defined by two parameters
- mean , standard deviation
impact of Sy on DSM
smaller Sy = skinnier/smaller normal distribution
impact of Sy on Sy-bar
larger Sy = larger Sy-bar
impact of n on DSM
smaller sample size = less closely the DSM approximates the normal distribution
impact of n on Sy-bar
smaller n = smaller Sy-bar
impact of Sy and n on Sy-bar
Sy-bar proportional to Sy/sqrt(n)
when samples are drawn from normally-distributed population, the bootstrapped DSM approximates the normal distribution
for larger samples, the standard deviation of the DSM can be estimated form sample data using :
Sy-bar = Sy/sqrt(n)
standard error
for larger samples, the standard deviation of the bootstrapped DSM is equal to standard error
standard deviation between a sample mean (y-bar) and mean of DSM
variance sum law
- objective of variance sum law: determine variance of Sum(yi), variance of the sum
variance is the average squared deviation score
- variance is an average, we can say that the average squared deviation of each score (yi) from the mean is Sy^2
- Sy^2 is calculated for the sample, variance is a property of each individual score
variance of a ‘constant times a random variable’
if you multiply a variable by a constant, the variance of the variable increases by the constant squared
deriving standard error
standard deviation of the distribution of sample means
- squaring standard error gives variance of sample means
calculate sample mean from scores using y-bar= (Sum yi)/n
use variance sum law to calculate the variance of Sum(yi)
use variance of a constant times a random variable rule to calculate variance of (Sumyi)/n by multiplying variance of Sumyi by (1/n)^2)
DSM from a skewed population
estimate the population mean by generating a bootstrapped DSM
procedure
1. sample expanded to infinite size
2. one sample selected at random, and mean calculated
3. repeated 1,000,000 times to generate million sample means
4. distribution of sample means visualized using a histogram
bootstrapped DSM
For larger n, the DSM approximates to the normal distribution even if the underlying poulation does not follow the normal distribution
- (standard deviation is standard error: sy-bar= Sy/sqrt(n))
for smaller n, the DSM still approximates reasonably close the normal distribution but only if the underlying population is normal
- (standard deviation is slightly overestimated by: sy/sqrt(n))
for smaller n, if the underlying pop is not normally-distributed the DSM can’t evicted quite markedly from the normal distribution
central limit theorem
alternative approach used to estimate the DSM from sample data
when samples are large, n>30, the distribution of sample means will be normal regardless of the distribution of the scores in the underlying population. the standard deviation of the distribution of sample means will equal the standard error
CLT allows estimation of the DSM without performing thousands of simulations, so was a far more efficient approach prior to modern computing
using CLT
uses normal distribution to estimate the shape of the DSM
- normal distribution is defined by two parameters (mu and sigma)
CLT uses sample mean (y-bar) to estimate the mean of DSM
CLT uses sample standard deviation (Sy) and sample size (n) to estimate the standard deviation of DSM
CLT therefore defines the DSM by assuming the shape, and estimating two parameters from sample data
CLT with smaller samples
with normally distributed samples of less than 30, CLT assumes that the DSM approximates to the t distribution with df= n-1
- smaller samples causes Sy-bar to be less accurate as an estimate of the standard deviation of the DSM
- using the t distribution compensates for this reduced accuracy
if n<30 and scores are not normal distribution
- don’t use CLT
- DSM can deviate too dramatically from t distribution
- conclusions based on assumption that the DSM does approx. to the t distribution can be erroneous
CLT and CI
estimate DSM using CLT, use DSM to determine 85% CI’s
procedure:
1. determine which distribution is appropriate (normal, t, number of df)
2. use y-bar, Sy, and n to define the mean (y-bar) and standard deviation of the DSM (Sy-bar)
3. 95% CI is the boundaries of the central 85% of means from the DSM
instead of visualizing the DSM, we can find the t-crit (critical value of t at the boundary of the 2.5% tails)
95% CI = y-bar +/- (tcrit x Sy-bar)
sample size and accuracy of mu
sample size increase = accuracy of estimate of mu increase
- due to the reduction in Sy-bar and change in t-crit
comparison between bootstrapping and CLT
n=100
- both methods for DSM generation generate almost identical results
with large n
- both approaches have similar results even if population of scores is skewed
bootstrapped 95% CI is narrower, especially for smaller n
bootstrapped 95% CI generated from skewed population is more likely to be asymmetrical, especially for lower n
- CLT 95% CI are always symmetrical as they are generated by assuming normal distribution
advantages of bootstrapping
bootstrapping requires fewer assumptions
- if n< 30, CLT relies on the assumption that the scores are normally-distributed
- assumption may be incorrect, resulting in DSM that is incorrect
bootstrapping makes fuller use of the data
- every score in the sample is represented within the hypothetical pop used for generating the DSM
- CLT uses all scores to estimate Sy, then uses this value to define the standard deviation of DSM
- reducing all scores to a single summary statistic results in loss of information, potentially reducing the accuracy of the DSM
- Potential error associated with the estimate of Sy-bar is compensated for by making the 95% CI wider through use of t distribution
bootstrapping is easier to understand
- CLT and normal distribution is complex math
- bootstrapping only needs mean, and calculate it a lot
advantages of CLT
estimating DSM using CLT requires one calculation
- bootstrapping is thousands of computations
CLT is central to classical stats while bootstrapping is more recent
- bootstrapped methods haven’t been developed for all statistical analyses or software (not in Excel)
CLT results are consistent
- bootstrap analysis involves randomly sampling scores, so results will be different every time bootstrapping is repeated (difference will not be large)
bootstrapping vs. CLT
advantages of bootstrapping outweigh disadvantages
unlike CLT, bootstrapping doesn’t rely on assumption of normality where n<30, less likely to generate incorrect results if assumption is incorrect
skewed population distribution
if population heavily skewed, better to use bootstrapping compared with the CLT if n< 30
if poulation is heavily skewed, the mean is not an appropriate choice of statistic (medialn better option of central tendency in skewed data)
generally preferred to transform the data to reduce/remove skew than calc DSM based on this transformed data
epi.conf() 95% CI CLT
epi.conf(y, conf.level=0.95)
generates
= sample mean
= standard error
= lower 95% CI boundary
= upper 95% CI boundary