Lecture 5: Linear Regression Flashcards
General Linear Model
Family of models used to analyse the relationship between one outcome and one or more predictors
Bivariate linear regression
Describes a linear relationship between a continuous outcome variable and a continuous predictor
Predictor
Variables that are used to predict some other variables or outcomes
Linear regression
Based on the concept of using information about other variables associated with the outcome to improve predictions. It begins with the understanding that the mean is the best predictor (expected value) when no further relevant information is available. However, if we have information about other variables (e.g., the number of hours studied being strongly associated with exam grades), we can use that information to enhance our predictions: this process is called regression.
In linear regression, we aim to find a line that best represents the best possible predictions. This line, called the regression line, goes through the middle of the cloud of data points.
Formula of the regression line
Y = a + bX
A is the intercept, where the line crosses the Y-axis. This is the predicted value when X equals 0.
B is the slope, how steeply the line in- or decreases.
Y increases by B when X increases by 1
Bivariate regression line
Yi = a + b * Xi + ei
Yi is the individual’s score on the dependent variable, a is the intercept, b is the slope, Xi is the individual’s score on the independent variable and ei is the individual prediction error
Prediction error
The difference between the prediction Yi and the observed value of that individual, Y1
Can be calculated by subtracting Y1 from Yi
Ordinary Least Squares method
Used to obtain the line that gives us the best possible predictions across all participants
Coefficients table
Constant equals the intercept, “Aantal uren voorbereiding” the slope. For the intercept, the T-statistic is 1.357 (obtained by dividing the intercept by the standard error). This is smaller than the critical value of 1.96, making the test non-significant.
For the slope, the T-statistic is 5.474, making it much larger than the critical value of 1.96. This makes our test significant, allowing us to reject the null hypothesis that the slope is equal to zero.
Assumptions of linear regression
Assumptions are statements about the population. We can only check if the assumption is true for the sample, but the sample may not be representative to the population.
1) Model is correctly specified, which includes:
- Linearity of relationship between X and Y
- Normality of residuals (prediction errors)
- Direction of causality (if you want to interpret your model causally)
2) Homoscedasticity (‘equal variance’)
- Residuals are equally distributed for all values of the predictors
- The points are homogeneously distributed around the zero line, like a vegan sausage
- The opposite is heteroscedasticity: the points are heterogeneously distributed around the zero line, in a funnel shape
3) Independence of observations
- A scatterplot is a visual check to ensure that the points follow a straight line => if so, then it’s linear
- Another way is using residual plots: a random cloud around the zero line indicates that the relationship is linear. If you see a pattern, it is not
- Violations of linearity:
1) Outlier: when one dot is far away from all the other dots
2) Curvilear: when, rather than following a linear line, the dots follow another, curvy line
How to deal with assumption violations?
- Violation with linearity
- Transform variable (square, square root)
- Include a quadratic term in your model - Normality of residuals
- Increase sample size
- Use different outcome distribution (e.g., binomial)
- Remove outliers - Violation of homoscedasticity (i.e., heteroscedasticity)
- Account for source of heteroscedasticity
R-square
Shows the proportion of variance in the dependent variable which can be explained by the independent variables.
Example: when looking at how years of education affects one’s income, an R-square of 0.179 (17.9%) shows that variability in earnings can be explained for 17.9% by years of education. The R-square is always a value between 0-1. A higher R-square generally suggests a better fitting model.
One limitation: as more predictors are added to the model, R-square will increase, even if the additional predictors do not contribute meaningfully to the model explaining the variance in the dependent variable.
Adjusted R-square
Useful for smaller samples. The larger the sample, the more the R-square and adjusted R-square become similar.
Adjusted R-square considers the number of predictors in a model and excludes irrelevant predictors that do not contribute significantly to explaining the variance. When comparing R-square and adjusted R-square, a higher adjusted R-square is preferred; it indicates a better balance between model fit and the number of predictors.