assumptions of the GLM Flashcards
CLT assumptions
when samples are large n> 30, the DSM will be normal regardless of the distribution of scores in the underlying population
if n< 30 and distribution of scores in population does not fit the normal distribution, the DSM might not be normal
application of the CLT where n< 30 therefore requires the assumption of normality (fundamental assumption of the GLM)
5 main assumptions of the GLM
- normality
- homogeneity of variance/ homoscedasticity
- linearity
- additivity
- independence
violating assumption of the GLM (model)
linearity: fit a linear model where the relationship is not actually linear
additivity: fit an additive model where relationship is not actually additive
normality: the mean is not an appropriate measure of central tendency (given the distribution of residuals in the model) leading to statistical bias
- statistical bias: where the sample statistic systematically over or under estimates the population parameter
violating assumption of the GLM (error)
- if other assumptions are incorrect, the error may not fit with the assumed distribution of errors, resulting in the method of significance testing being appropriate to the data, causing p values to be unreliable
- deviation from normality, homogeneity of variance and homoscedasiticity can all result in a mismatch between the actual sampling distribution (equivalent to the DSM) and the theoretical distribution
ex. distribution of t statistics that would be calculated if H0 is true does not fit the theoretical t distribution
conversion of test statistic to p value may be incorrect
- assumption of independence is important for a slightly different reason
- violating the assumption of independence results in a sample that is not representative of the underlying pop
- unreliable conclusion - inferential stats is formalized guess work, where we use defined framework to make inferences about the op from the sample
- the define framework contains defined assumptions that allow us to make these inferences
how to test the assumptions of the GLM
lm() fits GLM to data (lm= linear model)
- useful in assessing assumption of the GLM
- first use lm to fit linear model to two different datasets
assessment of GLM assumptions generates results that we typically don’t show
why are the tests not perfect?
- several rely on subjective judgements based on visualizations of data
- several are underpowered (small samples, false neg) or overpowered
- large samples, small and unimportant deviations from assumption may achieve stat sig
- true positives, don’t rep problematic deviations
as.factor()
use to overwrite two_group$x variable by converting categorical variable to a factor (grouping variable)
ex. as.factor (two_group$x)
plot()
quick visualization of data
fitting linear model in R
using LSE:
lm(outcome~predictor, data)
outputs:
regression coefficients (can build GLM)
write results to an object output:
- coefficients: regression coefficients (b0,b1)
- fitted.values: predicated values of y-hat
- residuals: values of error for yi(yi-y-hat)
inferential statistics
using output from lm function,c an run summary() and ANOVA() to obtain results
can see that results for two group analysis using ANOVA match for two-group test
ANOVA: F is test statistic, that is t^2
Lm outputs in assumptions of GLM
tests:
normality
linearity
homogeneity of variance and homoscedasticity
normality of residuals
GLM assumes residuals have normal distribution
residual distribution is generally different from distribution of scores
- can view distribution of scores to understand why residuals deviate from normality
- ex. if residuals have a positive skew, and y variable has a positive skew, might be possible to correct the skew of residuals by transforming the y variable
why do we assume normality of residuals?
least squares error method (LSE) assumes normality
- if residuals are noraml, mean will be appropriate model
- if residuals are skewed, mean is not appropriate measure of central tendency
sig testing assumes normality of residuals
- residuals are used to build the sampling distribution of test statistics, based on the assumption of normality
visualizing residuals in R
- extract residuals from lm output into a vector
- ex. lm_two_group$residuals - use hist() to visualize the distribution of residuals
- ex. hist(lm_two_group$residuals, breaks=20) - to better see how fits with normality, can replot using ggplot to add the normal curve
quantile-quantile (QQ) plots
allow us to use a straight line to judge fit of two curves
theoretical normal distribution is divided into quantiles
- quantile: subset of defined size
- values at the boundaries of each quantile are then extracted
observed distribution of residuals is divided into quantiles
- values at boundaries of each quantile are extracted
boundaries of the quantiles for the normal distribution (x) are plotted against the boundaries of quantiles for the observed distribution of residuals (y) forming a quantile-quantile (Q-Q) plot
- two distributions are identical, all points would fit on straight line where x=y
qq plots from lm outputs
plot(lm_two_group, 2)
- the 2: means it’s a Q-Q plot
significance testing for normality
H0: distribution is normal
- sig result means observed distribution is sig different form normality
not only rely on sig testing
- non-sig results may be false negatives
- sig results may not be important if large sample size
H0: observed distribution is the same as the normal distribution
- if test for normality is significant, the data doesn’t fit the normal distribution
- if test for normality is not significant, the data does fit normality
shapiro-wilk test for normality
shapiro.test(vector_of_residuals)
ex.shapiro.test(lm_continuous$residuals)
output: get a p value