General linear model Flashcards
General linear model
all about expressing the relationship between variables
For example…
What is the relation between a test score and the grouping variable
What is the relation between a pre-and-post-test measure
GLM Stat tests
T-tests
ANOVA, ANCOVA, MANOVA, MANCOVA
Correlations (pearson and spearman)
Linear regressions and multiple regressions
Goodness of fit test/chi squares
Machine learning and prediction models
GLM equation
using this equation, we can predict the outcome variable Ŷi for participant i, as long as we have the X value for participant i
Ŷi = b0 + b1xi + ei
Ŷ
Ŷ is the estimate of the observation outcome (Y) ie. represent the estimated DV
b0
b0 is the intercept of the regression line (where is crosses the axis)
B1
B1 is the the slope of the regression line
xi
xi is the observation of the predictor (X) ie represents the IV
ei
ei Is the residual error term, which is the difference between observed and predicted Y
i
i stands for the participant whose data is being used
Correlation
a standardized measure of the linear relation between two variables
X and Y are interchangeable
r-value
The correlation is represented by an r-value that can take any value between -1.00 to +1.00
The numerical value represents the shape of the correlation, the positive or negative represents the direction
A 1 is a perfect line, while a smaller value like 0.2 will be unfocused along a line
positive means when the IV increases so does the DV (and the oppositve for negative)
Common correlation interpretations
For absolute correlation values (positive or negative), common interpretation:
= 0.00 no relation, entirely random
0.01 to 0.30 weak
0.30 to 0.50 moderate
0.50 to 0.99 strong
= 1.00 perfect, identical
But these rules are arbitrary and should be based on the context of the study
Anscombe quartet
Idea that graphed data can look totally different but have the same summary statistics
Ordinary least squares regression
The general linear model tries to create or fit a line (line of best fit) through the datapoints that is as close as possible to every datapoint
This is done by minimizing the squared distance between the line and each point, which is why it is called “Ordinary least squares regression”
By default its estimates come out unstandardized ie. using the units of the original variables
Coefficients for ordinary least squares regression
For ordinary least squares regression: The estimated regression coefficients (b0 and b1) are the those that minimise the sum of the squared residuals.
Take the distance between a datapoint and the fitted line
Square that distance
Repeat for all datapoints, and sum up all these surfaces
Find the line where this combined surface is the smallest.
Unstandardized
Unstandardized coefficients are based on raw data and one unit changes in the IV
Unstandardized estimates more intuitive, but can’t easily be compared across different kinds of measurements
“For every 1 min difference in average exercise per day, there’s a 0.017 difference in BMI” (unstandardized)
Standardized
Standardized data is on analyzed data/standard deviations
Standardized estimates are less concrete, but can be compared across different measurements; can use the correlation “rules of thumb” we discussed above
“For every 1SD difference in average exercise per day, there’s a 0.176 SD difference in BMI” (standardized)
OR
“Average minutes exercise per day 3% of the variance in BMI” (standardized R2)
Multiple/linear regression model formula
When including several predictors: Need Multiple regression model
Ŷi = b0 + b1xi + b2xi +ei ,
Can go on adding as many predictors as makes sense
Ŷi = b0 + b1xi + b2xi + b3xi … +ei ,
But, instead of a singular line, we are now trying to create a plane within a 3D space that still minimizes the distance between observed data points
Controlling/Adjusting/Partialing out in Linear Regressions
All refer to the same process of when having multiple variables in your regression
If you control outcome Y for predictor variable X, then check the association between variable Z and outcome Y, you’re asking: “what would be the Y~Z relation in a sample where everyone had the average level of X?”
This is not MAGIC
Predictions from regression models, even if “controlled” don’t suddenly make associations causal
All depends on where your data came from
If they’re from a randomised experiment, causal conclusions might be justified
If they’re from an observational study, probably no
(Multiple) regression assumptions
Normality (of residuals) (-> if you were to plot the residuals you would see a normal distribution)
Linearity (-> associations between X and Y are linear, aka constant)
Homogeneity of variance (of residuals)
Uncorrelated predictors (-> no collinearity)
Uncorrelated residuals (-> no effect of another unmeasured variable)
No highly-influential outliers
T-tests and GLM
Subtype of GLM, equivalent of simple linear regression
Think of the intercept as the mean of group 1
And the slope as the distance from the intercept to the mean of group 2 (the black line)
Ŷi = b0(mean_group1) + b1(mean_group2–mean_group1)xi + ei
ANOVA
Comparing more than 2 group
ANOVA (Analysis Of VAriance) is a kind of general linear model that only has categorical predictors
Even though it’s called analysis of variance, it’s actually mainly interested in differences between means
A one-way ANOVA, comparing ≥ 3 means, is equivalent to a multiple regression model
The ANOVA’s test statistic is the F-ratio – the ratio of variance explained between the groups to that explained within them
R: Get r-value/cor coeff
cor.test(dataset$variable_1 , dataset$variable_2)
R: Plot correlation
plot(dataset$variable_1 , dataset$variable_2)
R: Plot with line of best fit (unstandardized)
ggplot(dataset, aes( x = IV, y + DV)0 + geom_point() + stat_smooth(method = lm)
R: linear model
new_name <- lm(IV ~ DV , data = dataset)
summary(new_name)
R: Find what type of data
class(dataset$variable)
R: Convert to factor
dataset$variable <- factor(dataset$variable)
R: Multiple regression
lm(variable_1 ~ variable_2 + predictor, data = dataset)
R: Remove outliers
dataset$variable[dataset$variable>or<#] <- NA
R: Multiple regression with interaction
name <- lm(variable_1 ~ variable_2 (+ predictor if necessary) + interactionv1:interactionv2, , data = dataset)
summary(name)
R: Plot interaction
interact_plot(name, pred = IV, modx = DV)
R: ANOVA
aov ()