assumptions of the GLM Flashcards
CLT assumptions
when samples are large n> 30, the DSM will be normal regardless of the distribution of scores in the underlying population
if n< 30 and distribution of scores in population does not fit the normal distribution, the DSM might not be normal
application of the CLT where n< 30 therefore requires the assumption of normality (fundamental assumption of the GLM)
5 main assumptions of the GLM
- normality
- homogeneity of variance/ homoscedasticity
- linearity
- additivity
- independence
violating assumption of the GLM (model)
linearity: fit a linear model where the relationship is not actually linear
additivity: fit an additive model where relationship is not actually additive
normality: the mean is not an appropriate measure of central tendency (given the distribution of residuals in the model) leading to statistical bias
- statistical bias: where the sample statistic systematically over or under estimates the population parameter
violating assumption of the GLM (error)
- if other assumptions are incorrect, the error may not fit with the assumed distribution of errors, resulting in the method of significance testing being appropriate to the data, causing p values to be unreliable
- deviation from normality, homogeneity of variance and homoscedasiticity can all result in a mismatch between the actual sampling distribution (equivalent to the DSM) and the theoretical distribution
ex. distribution of t statistics that would be calculated if H0 is true does not fit the theoretical t distribution
conversion of test statistic to p value may be incorrect
- assumption of independence is important for a slightly different reason
- violating the assumption of independence results in a sample that is not representative of the underlying pop
- unreliable conclusion - inferential stats is formalized guess work, where we use defined framework to make inferences about the op from the sample
- the define framework contains defined assumptions that allow us to make these inferences
how to test the assumptions of the GLM
lm() fits GLM to data (lm= linear model)
- useful in assessing assumption of the GLM
- first use lm to fit linear model to two different datasets
assessment of GLM assumptions generates results that we typically don’t show
why are the tests not perfect?
- several rely on subjective judgements based on visualizations of data
- several are underpowered (small samples, false neg) or overpowered
- large samples, small and unimportant deviations from assumption may achieve stat sig
- true positives, don’t rep problematic deviations
as.factor()
use to overwrite two_group$x variable by converting categorical variable to a factor (grouping variable)
ex. as.factor (two_group$x)
plot()
quick visualization of data
fitting linear model in R
using LSE:
lm(outcome~predictor, data)
outputs:
regression coefficients (can build GLM)
write results to an object output:
- coefficients: regression coefficients (b0,b1)
- fitted.values: predicated values of y-hat
- residuals: values of error for yi(yi-y-hat)
inferential statistics
using output from lm function,c an run summary() and ANOVA() to obtain results
can see that results for two group analysis using ANOVA match for two-group test
ANOVA: F is test statistic, that is t^2
Lm outputs in assumptions of GLM
tests:
normality
linearity
homogeneity of variance and homoscedasticity
normality of residuals
GLM assumes residuals have normal distribution
residual distribution is generally different from distribution of scores
- can view distribution of scores to understand why residuals deviate from normality
- ex. if residuals have a positive skew, and y variable has a positive skew, might be possible to correct the skew of residuals by transforming the y variable
why do we assume normality of residuals?
least squares error method (LSE) assumes normality
- if residuals are noraml, mean will be appropriate model
- if residuals are skewed, mean is not appropriate measure of central tendency
sig testing assumes normality of residuals
- residuals are used to build the sampling distribution of test statistics, based on the assumption of normality
visualizing residuals in R
- extract residuals from lm output into a vector
- ex. lm_two_group$residuals - use hist() to visualize the distribution of residuals
- ex. hist(lm_two_group$residuals, breaks=20) - to better see how fits with normality, can replot using ggplot to add the normal curve
quantile-quantile (QQ) plots
allow us to use a straight line to judge fit of two curves
theoretical normal distribution is divided into quantiles
- quantile: subset of defined size
- values at the boundaries of each quantile are then extracted
observed distribution of residuals is divided into quantiles
- values at boundaries of each quantile are extracted
boundaries of the quantiles for the normal distribution (x) are plotted against the boundaries of quantiles for the observed distribution of residuals (y) forming a quantile-quantile (Q-Q) plot
- two distributions are identical, all points would fit on straight line where x=y
qq plots from lm outputs
plot(lm_two_group, 2)
- the 2: means it’s a Q-Q plot
significance testing for normality
H0: distribution is normal
- sig result means observed distribution is sig different form normality
not only rely on sig testing
- non-sig results may be false negatives
- sig results may not be important if large sample size
H0: observed distribution is the same as the normal distribution
- if test for normality is significant, the data doesn’t fit the normal distribution
- if test for normality is not significant, the data does fit normality
shapiro-wilk test for normality
shapiro.test(vector_of_residuals)
ex.shapiro.test(lm_continuous$residuals)
output: get a p value
homogeneity of variance
if x is categorical (forms groups), the assumption is called homogeneity of variance
violation of this assumption means there is heterogeneity of variance
homoscedasticity
if x is continuous, the assumption is homoscedasticity
violation of this means there’s heteroscedasticity (cone shape of residuals, gets larger the higher the values)
violating homogeneity of variance and homoscedasticity
heterogeneity of variance/heteroscedasticity is characterized by having larger residuals for larger (or smaller) values of y-hat
- residuals may still be symmetrical, in which case the regression coefficients will remain unbiased
- the model may be valued
estimates of population variance may be inaccurate, as estimates generated from sample data will vary depending on value of y-hat
- if estimates of population variance are inaccurate, sampling distributions may be inaccurate, creating error in the estimation of the p value
assumption of homogeneity of variance and independent t test
if variance estimates are similar for the two groups= generate single estimate of population— pooled variance
if variance estimates are different between the two groups, pooled variance is a poor estimate in either pop
- use Welch’s t test which doesn’t assume homogeneity of variance
for two sample independent groups data, deviation form homogeneity of variance can be tolerated by simply selecting WELCH’S t
Levene’s test for homogeneity of variance
H0: there is homogeneity of variance between groups (no different between residuals between groups)
if sign: we have difference in variance/size of residuals between groups —> heterogeneity of variance
- fit a linear model to calc residuals
- convert residuals to absolute residuals
- uses ANOVA (grouped data >2)
can be underpowered (small n) and overpowered (large n)
- important to consider the results from sig testing, and visualize data when making a judgement
- can visualize heterogeneity of variance using pred-resid plots
levene’s test in R
leveneTest(y~x, two_group, center= “mean”)
default center is median
output: p value
if we encounter significant heterogeneity of variance
if using two group independent data: use Welch’s t
may be able to correct heterogeneity using data transformation
unless heterogeneity is dramatic, we can often ignore it — ANOVA is reasonably tolerant to deviations from homogeneity of variance
homoscedasticity
assumption that residuals does not change as a function of y-hat
- refers to datasets where x is continuous
determine whether residuals vary as a function of y-hat, simply plot predicted values of y (x axis) against residuals (y axis)
- pred-resid plot
vertical height from 0 shows mag of residuals
- no clear change in residuals as a function of x
heteroscedasticity
points around the line of best fit increases as y-hat increases
grouped data
pred-resid plot
useful when multiple groups as orders by increasing value of group mean
- easier to identify if variance increased as a function of y-hat
generating pred-resid plots in R
plot(lm_continuous, 1)
1: pred-resid
loess method
line of best fit where subset of data points and draws line of best fit for local points
- often looks wavy
zpred-zresid plots
common variant of pred-resid plot converts pred and resid value to z scores, standardizing the scales
z= 0 represents the mean
z= 1 represents the score at one standard deviation above the mean
useful for identifying outliers
- any score with a residual more than 3 standard deviations from the mean
- with zpred-zresid plots, these scores would have residuals <-3 or >3
assumption of linearity
assess linearity by visualizing the data
can add line of best fit using geom_smooth() function
- specifify linear model, use argument: method=”lm”
- fit a curve, use argument: method= “loess”
- adds 95% CI for line of best fit, can remove by:
- se=FALSE
- adjusted by level=0.95
- can be colourized by fill
linearity and pred-resid plots
deviation from linearity is clear from scatterplot
- nonlinear relationships are more obvious
- if relationship was linear, line of best fit would approximate to horizontal line
correcting deviations from linearity
possible to linearize the relationship between variables by data transformation
if relationship can’t be linearized, GLM can’t be applied
assumption of additivity
only applies to models with multiple predictor variables
assumes that the effects caused by one predictor variable are simply added to the effects caused by a second predictor variable
ex. measuring depression in humans (outcome), some participating experience early life stress and some didn’t (predictor 1). some participants have a genotype that offers protection against stress, other have susceptible genotype (predictor 2). examining combined effects of stress and genotype on depression
additivity spotting in graph
if parallel, the effect of genotype is the same and doesn’t depend on stress
if not additive and causes interaction; the effect of genotype changes because of stress
assumption of independence
will be met if the value of one score is unaffected by values of other scores for the same variable
- every score for a variable is independent of every other score
- if score from one individual is affected by another individual, there’s a lack of independence
ex. light source on microscope is dying and fluorescence scores get progressively lower
stopping this is largely achieved through careful experimental designs
testing for independence
plotting residuals against the order in which the scores were collected
data transformation
mathematical manipulation of a variable
deviations from normality, homogeneity of variance/homoscedasticity and linearity can be corrected
but fixing deviation from one assumption can cause deviation from another assumptions
prioritizing assumptions
- linearity is most important
- if data deviates from linearity, we are fitting the wrong model and every part of the analysis will be incorrect - normality is second
- if data deviates from normality,estimates of regression coefficients may be biased, and distribution of test statistics may deviate from theoretical sampling distribution
- deviations from theoretical sampling distribution can be mitigated if; sample size is large, bootstrapping is used (doesn’t assume normality) - heterogeneity of variance/heteroscedasticity is least important
- distribution of test statistics may deviate from theoretical sampling distribution
- less problematic if we have a large n or bootstrap
which data transformation is appropriate?
identifying can be difficult, trial and error
use distribution of residuals as a guide
distribution of x and y scores can also be useful in determining which variable may be causing the residuals to deviate from normality
transform to remove positive skew
- square root
- cube root (more extreme skew)
- log2(y)
- log10(y) [most extreme]
transform to remove negative skew
- square
- cube
- 2^y
- 10^y
testing assumptions: summary code
data$y <- data$y^2
- squaring the y data
fit linear model: lm(y~x, data=data)
distribution of residuals: hist(lm_out$residuals)
test for normality: shapiro.test(lm_out$residuals)
distribution of x scores: hist(data$x)
distribution of y scores: hist(data$y)
Q-Q plots: plot(lm_out, 2)
pred-resid plots: plot(lm_out, 1)
test for homogeneity of variance (group data only): car:: leveneTest(y~x, data=data)
transform x or y
if x is continuous and relationship between x and y is nonlinear, you can transform x or y
if residuals deviate from normality and scores for one the variables similarly deviates, that variable is candidate for transformation
heteroscedasticity: transformation of y is most effective
arguments for transformation
if data doesn’t fit assumptions of GLM, conclusions may not be valid
some biological relationships, transformation make sense
ex. concentration gradient from a source follows a cube root distribution
arguments against transformation
if transform a measured variable, meaning of that variable becomes less intuitive
data transformation may not be able to fix everything
- may create deviation
data transformation can be use unscrupulously
- p-hacking (avoid by: not performing hypothesis testing on data until after determining best way to transform data)