assumptions of the GLM Flashcards

1
Q

CLT assumptions

A

when samples are large n> 30, the DSM will be normal regardless of the distribution of scores in the underlying population

if n< 30 and distribution of scores in population does not fit the normal distribution, the DSM might not be normal

application of the CLT where n< 30 therefore requires the assumption of normality (fundamental assumption of the GLM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

5 main assumptions of the GLM

A
  1. normality
  2. homogeneity of variance/ homoscedasticity
  3. linearity
  4. additivity
  5. independence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

violating assumption of the GLM (model)

A

linearity: fit a linear model where the relationship is not actually linear

additivity: fit an additive model where relationship is not actually additive

normality: the mean is not an appropriate measure of central tendency (given the distribution of residuals in the model) leading to statistical bias
- statistical bias: where the sample statistic systematically over or under estimates the population parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

violating assumption of the GLM (error)

A
  1. if other assumptions are incorrect, the error may not fit with the assumed distribution of errors, resulting in the method of significance testing being appropriate to the data, causing p values to be unreliable
    - deviation from normality, homogeneity of variance and homoscedasiticity can all result in a mismatch between the actual sampling distribution (equivalent to the DSM) and the theoretical distribution

ex. distribution of t statistics that would be calculated if H0 is true does not fit the theoretical t distribution
conversion of test statistic to p value may be incorrect

  1. assumption of independence is important for a slightly different reason
    - violating the assumption of independence results in a sample that is not representative of the underlying pop
    - unreliable conclusion
  2. inferential stats is formalized guess work, where we use defined framework to make inferences about the op from the sample
    - the define framework contains defined assumptions that allow us to make these inferences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how to test the assumptions of the GLM

A

lm() fits GLM to data (lm= linear model)
- useful in assessing assumption of the GLM
- first use lm to fit linear model to two different datasets

assessment of GLM assumptions generates results that we typically don’t show

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

why are the tests not perfect?

A
  1. several rely on subjective judgements based on visualizations of data
  2. several are underpowered (small samples, false neg) or overpowered
  3. large samples, small and unimportant deviations from assumption may achieve stat sig
    - true positives, don’t rep problematic deviations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

as.factor()

A

use to overwrite two_group$x variable by converting categorical variable to a factor (grouping variable)

ex. as.factor (two_group$x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

plot()

A

quick visualization of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

fitting linear model in R

A

using LSE:
lm(outcome~predictor, data)

outputs:
regression coefficients (can build GLM)

write results to an object output:
- coefficients: regression coefficients (b0,b1)
- fitted.values: predicated values of y-hat
- residuals: values of error for yi(yi-y-hat)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

inferential statistics

A

using output from lm function,c an run summary() and ANOVA() to obtain results

can see that results for two group analysis using ANOVA match for two-group test

ANOVA: F is test statistic, that is t^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Lm outputs in assumptions of GLM

A

tests:
normality
linearity
homogeneity of variance and homoscedasticity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

normality of residuals

A

GLM assumes residuals have normal distribution

residual distribution is generally different from distribution of scores
- can view distribution of scores to understand why residuals deviate from normality
- ex. if residuals have a positive skew, and y variable has a positive skew, might be possible to correct the skew of residuals by transforming the y variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

why do we assume normality of residuals?

A

least squares error method (LSE) assumes normality
- if residuals are noraml, mean will be appropriate model
- if residuals are skewed, mean is not appropriate measure of central tendency

sig testing assumes normality of residuals
- residuals are used to build the sampling distribution of test statistics, based on the assumption of normality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

visualizing residuals in R

A
  1. extract residuals from lm output into a vector
    - ex. lm_two_group$residuals
  2. use hist() to visualize the distribution of residuals
    - ex. hist(lm_two_group$residuals, breaks=20)
  3. to better see how fits with normality, can replot using ggplot to add the normal curve
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

quantile-quantile (QQ) plots

A

allow us to use a straight line to judge fit of two curves

theoretical normal distribution is divided into quantiles
- quantile: subset of defined size
- values at the boundaries of each quantile are then extracted

observed distribution of residuals is divided into quantiles
- values at boundaries of each quantile are extracted

boundaries of the quantiles for the normal distribution (x) are plotted against the boundaries of quantiles for the observed distribution of residuals (y) forming a quantile-quantile (Q-Q) plot
- two distributions are identical, all points would fit on straight line where x=y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

qq plots from lm outputs

A

plot(lm_two_group, 2)
- the 2: means it’s a Q-Q plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

significance testing for normality

A

H0: distribution is normal
- sig result means observed distribution is sig different form normality

not only rely on sig testing
- non-sig results may be false negatives
- sig results may not be important if large sample size

H0: observed distribution is the same as the normal distribution
- if test for normality is significant, the data doesn’t fit the normal distribution
- if test for normality is not significant, the data does fit normality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

shapiro-wilk test for normality

A

shapiro.test(vector_of_residuals)

ex.shapiro.test(lm_continuous$residuals)

output: get a p value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

homogeneity of variance

A

if x is categorical (forms groups), the assumption is called homogeneity of variance

violation of this assumption means there is heterogeneity of variance

20
Q

homoscedasticity

A

if x is continuous, the assumption is homoscedasticity

violation of this means there’s heteroscedasticity (cone shape of residuals, gets larger the higher the values)

21
Q

violating homogeneity of variance and homoscedasticity

A

heterogeneity of variance/heteroscedasticity is characterized by having larger residuals for larger (or smaller) values of y-hat
- residuals may still be symmetrical, in which case the regression coefficients will remain unbiased
- the model may be valued

estimates of population variance may be inaccurate, as estimates generated from sample data will vary depending on value of y-hat
- if estimates of population variance are inaccurate, sampling distributions may be inaccurate, creating error in the estimation of the p value

22
Q

assumption of homogeneity of variance and independent t test

A

if variance estimates are similar for the two groups= generate single estimate of population— pooled variance

if variance estimates are different between the two groups, pooled variance is a poor estimate in either pop
- use Welch’s t test which doesn’t assume homogeneity of variance

for two sample independent groups data, deviation form homogeneity of variance can be tolerated by simply selecting WELCH’S t

23
Q

Levene’s test for homogeneity of variance

A

H0: there is homogeneity of variance between groups (no different between residuals between groups)

if sign: we have difference in variance/size of residuals between groups —> heterogeneity of variance

  1. fit a linear model to calc residuals
  2. convert residuals to absolute residuals
  3. uses ANOVA (grouped data >2)

can be underpowered (small n) and overpowered (large n)
- important to consider the results from sig testing, and visualize data when making a judgement
- can visualize heterogeneity of variance using pred-resid plots

24
Q

levene’s test in R

A

leveneTest(y~x, two_group, center= “mean”)
default center is median
output: p value

25
Q

if we encounter significant heterogeneity of variance

A

if using two group independent data: use Welch’s t

may be able to correct heterogeneity using data transformation

unless heterogeneity is dramatic, we can often ignore it — ANOVA is reasonably tolerant to deviations from homogeneity of variance

26
Q

homoscedasticity

A

assumption that residuals does not change as a function of y-hat
- refers to datasets where x is continuous

determine whether residuals vary as a function of y-hat, simply plot predicted values of y (x axis) against residuals (y axis)
- pred-resid plot

vertical height from 0 shows mag of residuals
- no clear change in residuals as a function of x

27
Q

heteroscedasticity

A

points around the line of best fit increases as y-hat increases

28
Q

grouped data

A

pred-resid plot

useful when multiple groups as orders by increasing value of group mean
- easier to identify if variance increased as a function of y-hat

29
Q

generating pred-resid plots in R

A

plot(lm_continuous, 1)
1: pred-resid

30
Q

loess method

A

line of best fit where subset of data points and draws line of best fit for local points
- often looks wavy

31
Q

zpred-zresid plots

A

common variant of pred-resid plot converts pred and resid value to z scores, standardizing the scales

z= 0 represents the mean
z= 1 represents the score at one standard deviation above the mean

useful for identifying outliers
- any score with a residual more than 3 standard deviations from the mean
- with zpred-zresid plots, these scores would have residuals <-3 or >3

32
Q

assumption of linearity

A

assess linearity by visualizing the data

can add line of best fit using geom_smooth() function
- specifify linear model, use argument: method=”lm”
- fit a curve, use argument: method= “loess”
- adds 95% CI for line of best fit, can remove by:
- se=FALSE
- adjusted by level=0.95
- can be colourized by fill

33
Q

linearity and pred-resid plots

A

deviation from linearity is clear from scatterplot
- nonlinear relationships are more obvious
- if relationship was linear, line of best fit would approximate to horizontal line

34
Q

correcting deviations from linearity

A

possible to linearize the relationship between variables by data transformation

if relationship can’t be linearized, GLM can’t be applied

35
Q

assumption of additivity

A

only applies to models with multiple predictor variables

assumes that the effects caused by one predictor variable are simply added to the effects caused by a second predictor variable

ex. measuring depression in humans (outcome), some participating experience early life stress and some didn’t (predictor 1). some participants have a genotype that offers protection against stress, other have susceptible genotype (predictor 2). examining combined effects of stress and genotype on depression

36
Q

additivity spotting in graph

A

if parallel, the effect of genotype is the same and doesn’t depend on stress

if not additive and causes interaction; the effect of genotype changes because of stress

37
Q

assumption of independence

A

will be met if the value of one score is unaffected by values of other scores for the same variable
- every score for a variable is independent of every other score
- if score from one individual is affected by another individual, there’s a lack of independence

ex. light source on microscope is dying and fluorescence scores get progressively lower

stopping this is largely achieved through careful experimental designs

38
Q

testing for independence

A

plotting residuals against the order in which the scores were collected

39
Q

data transformation

A

mathematical manipulation of a variable

deviations from normality, homogeneity of variance/homoscedasticity and linearity can be corrected

but fixing deviation from one assumption can cause deviation from another assumptions

40
Q

prioritizing assumptions

A
  1. linearity is most important
    - if data deviates from linearity, we are fitting the wrong model and every part of the analysis will be incorrect
  2. normality is second
    - if data deviates from normality,estimates of regression coefficients may be biased, and distribution of test statistics may deviate from theoretical sampling distribution
    - deviations from theoretical sampling distribution can be mitigated if; sample size is large, bootstrapping is used (doesn’t assume normality)
  3. heterogeneity of variance/heteroscedasticity is least important
    - distribution of test statistics may deviate from theoretical sampling distribution
    - less problematic if we have a large n or bootstrap
41
Q

which data transformation is appropriate?

A

identifying can be difficult, trial and error

use distribution of residuals as a guide

distribution of x and y scores can also be useful in determining which variable may be causing the residuals to deviate from normality

42
Q

transform to remove positive skew

A
  1. square root
  2. cube root (more extreme skew)
  3. log2(y)
  4. log10(y) [most extreme]
43
Q

transform to remove negative skew

A
  1. square
  2. cube
  3. 2^y
  4. 10^y
44
Q

testing assumptions: summary code

A

data$y <- data$y^2
- squaring the y data

fit linear model: lm(y~x, data=data)
distribution of residuals: hist(lm_out$residuals)
test for normality: shapiro.test(lm_out$residuals)
distribution of x scores: hist(data$x)
distribution of y scores: hist(data$y)
Q-Q plots: plot(lm_out, 2)
pred-resid plots: plot(lm_out, 1)
test for homogeneity of variance (group data only): car:: leveneTest(y~x, data=data)

45
Q

transform x or y

A

if x is continuous and relationship between x and y is nonlinear, you can transform x or y

if residuals deviate from normality and scores for one the variables similarly deviates, that variable is candidate for transformation

heteroscedasticity: transformation of y is most effective

46
Q

arguments for transformation

A

if data doesn’t fit assumptions of GLM, conclusions may not be valid

some biological relationships, transformation make sense
ex. concentration gradient from a source follows a cube root distribution

47
Q

arguments against transformation

A

if transform a measured variable, meaning of that variable becomes less intuitive

data transformation may not be able to fix everything
- may create deviation

data transformation can be use unscrupulously
- p-hacking (avoid by: not performing hypothesis testing on data until after determining best way to transform data)