Lecture 5: Linear Regression Flashcards

Question 1

Q

General Linear Model

Answer

A

Family of models used to analyse the relationship between one outcome and one or more predictors

Question 2

Q

Bivariate linear regression

Answer

A

Describes a linear relationship between a continuous outcome variable and a continuous predictor

Question 3

Q

Predictor

Answer

A

Variables that are used to predict some other variables or outcomes

Question 4

Q

Linear regression

Answer

A

Based on the concept of using information about other variables associated with the outcome to improve predictions. It begins with the understanding that the mean is the best predictor (expected value) when no further relevant information is available. However, if we have information about other variables (e.g., the number of hours studied being strongly associated with exam grades), we can use that information to enhance our predictions: this process is called regression.

In linear regression, we aim to find a line that best represents the best possible predictions. This line, called the regression line, goes through the middle of the cloud of data points.

Question 5

Q

Formula of the regression line

Answer

A

Y = a + bX

A is the intercept, where the line crosses the Y-axis. This is the predicted value when X equals 0.

B is the slope, how steeply the line in- or decreases.

Y increases by B when X increases by 1

Question 6

Q

Bivariate regression line

Answer

A

Yi = a + b * Xi + ei

Yi is the individual’s score on the dependent variable, a is the intercept, b is the slope, Xi is the individual’s score on the independent variable and ei is the individual prediction error

Question 7

Q

Prediction error

Answer

A

The difference between the prediction Yi and the observed value of that individual, Y1

Can be calculated by subtracting Y1 from Yi

Question 8

Q

Ordinary Least Squares method

Answer

A

Used to obtain the line that gives us the best possible predictions across all participants

Question 9

Q

Coefficients table

Answer

A

Constant equals the intercept, “Aantal uren voorbereiding” the slope. For the intercept, the T-statistic is 1.357 (obtained by dividing the intercept by the standard error). This is smaller than the critical value of 1.96, making the test non-significant.

For the slope, the T-statistic is 5.474, making it much larger than the critical value of 1.96. This makes our test significant, allowing us to reject the null hypothesis that the slope is equal to zero.

Question 10

Q

Assumptions of linear regression

Answer

A

Assumptions are statements about the population. We can only check if the assumption is true for the sample, but the sample may not be representative to the population.

1) Model is correctly specified, which includes:
- Linearity of relationship between X and Y
- Normality of residuals (prediction errors)
- Direction of causality (if you want to interpret your model causally)

2) Homoscedasticity (‘equal variance’)
- Residuals are equally distributed for all values of the predictors
- The points are homogeneously distributed around the zero line, like a vegan sausage
- The opposite is heteroscedasticity: the points are heterogeneously distributed around the zero line, in a funnel shape

3) Independence of observations
- A scatterplot is a visual check to ensure that the points follow a straight line => if so, then it’s linear
- Another way is using residual plots: a random cloud around the zero line indicates that the relationship is linear. If you see a pattern, it is not
- Violations of linearity:
1) Outlier: when one dot is far away from all the other dots
2) Curvilear: when, rather than following a linear line, the dots follow another, curvy line

Question 11

Q

How to deal with assumption violations?

Answer

A

Violation with linearity
- Transform variable (square, square root)
- Include a quadratic term in your model
Normality of residuals
- Increase sample size
- Use different outcome distribution (e.g., binomial)
- Remove outliers
Violation of homoscedasticity (i.e., heteroscedasticity)
- Account for source of heteroscedasticity

Question 12

Q

R-square

Answer

A

Shows the proportion of variance in the dependent variable which can be explained by the independent variables.

Example: when looking at how years of education affects one’s income, an R-square of 0.179 (17.9%) shows that variability in earnings can be explained for 17.9% by years of education. The R-square is always a value between 0-1. A higher R-square generally suggests a better fitting model.

One limitation: as more predictors are added to the model, R-square will increase, even if the additional predictors do not contribute meaningfully to the model explaining the variance in the dependent variable.

Question 13

Q

Adjusted R-square

Answer

A

Useful for smaller samples. The larger the sample, the more the R-square and adjusted R-square become similar.

Adjusted R-square considers the number of predictors in a model and excludes irrelevant predictors that do not contribute significantly to explaining the variance. When comparing R-square and adjusted R-square, a higher adjusted R-square is preferred; it indicates a better balance between model fit and the number of predictors.

Lecture 5: Linear Regression Flashcards

(13 cards)