Simple Regression & Multiple Regression (W9) ✅ Flashcards
State the main difference between bivariate linear correlation and regression?
Similarity: both are used when the relationship between x and y can be described with a straight line
Differences:
1. Correlation ONLY determines the strength of relationship between x and y
- Regression:
- allows us to estimate how much y will change as a result of a given change in x
- regression also distinguish between the variable being predicted & variable used to predict (NOT manipulated)
x = predictor/independent/ explanatory variable
y = outcome/dependent/ criterion variable
=> HOWEVER! still not provide direct evidence of causality (NOT x causes y)
What are the 3 stages of regression?
- Analyse the relationship between variables (find strength and direction of the relationship)
- Proposed a model to explain that relationship
-> regression line = line of best fit) - Evaluate the model: assess goodness of fit
-> does our regression model better at predicting y than the simplest model (assume no relationship between x & y AND only show mean of all y-values)
What are the two properties of regression line?
a = the intercept
-> value of y when x is 0 (starting point)
b = the slope
-> how much y changes from x increasing by 1 unit
Formula to calculate y-value based on x-value:
y = bx + a
(when x and y are negatively correlated, b is negative)
How to calculate goodness of fit and assess it?
Calculating goodness of fit:
- Calculate total variance (SST): the difference between the observed values of y and the mean of y (where b = 0)
-> variance not explained by the simplest model - Calculate SSR : the difference between the observed values of y and those predicted by the
regression line.
-> variance not explained by the regression model - Calculate SSM: reflects the improvement in prediction using the regression model compared to the simplest model
-> SST - SSR = SSM
=> The larger the SSM, the bigger the improvement
Assessing the goodness of fit -> using F-test
-> take the df into account
-> rather than using Sum of Squares (SS) values, use Mean Squares (MS) values
F = MSM / MSR
How to interpret the goodness of fit?
- If the regression model is good at predicting y:
-> the improvement in prediction due to the model (MSM) will be large
-> level of inaccuracy of the model (MSR) will be small
=> F value further from 0 - Assess probability: assume Null Hypothesis is true “the regression model and the simplest model are equal in terms of predicting y”
-> MSM = 0 - Significant result (p < 0.05) suggests that the regression model provides a better fit
for the data than the simplest model
What is the assumption of linear regression? (as compared to linear correlation)
- Linearity: x and y must be linearly related
- Absence of outliers
-> regression extremely sensitive to outliers
-> may be appropriate to remove it - Normality, linearity and homoscedasticity, independence of residuals
- Normality: residuals is normally distributed around the expected outcome
- Linearity: residuals and outcome in a straight line relationship
- Homoscedasticity: all variances of residual should be the same about the outcome
=> No non-parametric equivalent
What are the assumptions for: (1) Normal P-P plot and (2) Scatterplot of Regression Standardized Residual?
- Ideally, data points will lie in a reasonably straight diagonal line, from bottom left to top right
-> no major deviations from normality - Ideally, residuals will be
roughly rectangularly distributed, with most scores concentrated in
the centre (0)
-> Don’t want to see systematic pattern to residuals (curvilinear, or higher on one side)
-> Outliers: standardised
residuals > 3.3 or < -3.3
What are the relationships between R, R^2 & Adjusted R^2 (in Model Summary Table)? Are R^2 and r^2 the same?
R (√R^2): strength of relationship between x and y
-> sign is not given
R^2: proportion of variance in y explained by model (SSM), relative to total variance in y (SST)
-> R^2 = SSM/SST
Are R^2 and r^2 the same?
-> If only one predictor then yes
Explain multiple regression and its assumptions as opposed to linear simple regression?
- Multiple regression allows us to assess the influence of several predictor variables (e.g. x1, x2, x3 etc…) on the outcome variable (y), even when predictor variables are:
-> combined
-> considered separately
-> y = b1x1 + b2x2 + … + a
—-
Assumptions:
- Linearity
- Absence of outliers
- Multicollinearity: ideally, predictors should be correlated to outcome variable (y), NOT with one another
-> chance of measuring the same thing if r = .9 - Normality, linearity and homoscedasticity, independence of residuals
- Sufficient sample size
-> results might be over-optimistic (not generalisable) if too few Ps
=> the assumptions for P-P plot and Scatterplot are the same with simple regression