W1: Multiple Linear Regression Flashcards
In multiple linear regression, there are:
Multiple independent variables (X) and one dependent variable (Y)
In multiple linear regression, there are:
Multiple independent variables (X) and one dependent variable (Y)
The multiple R-squared value for a regression represent the proportion of the variation in the Y variable that can explained by its regression on the X variables.
True or False?
True
The assumptions which we need to check when we perform a multiple linear regression are (3):
Normality of the errors
Common variance of the errors
Independence of the errors
For the Kolmogorov-Smirnov and Shapiro-Wilk tests of Normality, if p < 0.05 then we conclude that the Normality assumption has been satisfied.
True or False?
Multiple linear regression
False
If the p-value for a correlation coefficient was p = 0.036 then the correlation would be significant at
5% level
We can use multiple linear regression to allow the use of several X-variables (predictors/IV) to predict the
response Y
What is the multiple linear regression model equation?
Y = a + (b1 * X1) + (b2 * X2) + … + e
What is the multiple linear regression model equation - Y?
Y is the response (DV)
What is the multiple linear regression model equation? - X
X is predictors/IV
What is the multiple linear regression model equation? - B1/B2
B1/B2 is the slope/gradient
What is the multiple linear regression model equation? - a
A is constant
What is the multiple linear regression model equation? - e
e is error term
The multiple linear regression has predictor variables (X) with its own
coefficient (b1/b2)
Why is their an error term ( e ) in multiple linear regression?
Knowing the values of X1,X2…. does not allow us to predict the value of Y exactly
What is a residual?
Difference between the observed Y-value and its prediction (fitted value) based on corresponding X-values
How to calculate residual?
Multiple linear regression
Residual = Observations - Fitted Valeu
If the scatterplot of residuals are not independent + common variance (funnel effect graph)
Multiple linear regression
If the scatterplot of residuals are not independent + common variance (funnel effect graph)
Graph does not have independence
Test signifiance of each predictor, test null and alternate hypothesis that:
Multiple linear regression
H0: b = 0 vs H1 : b≠ 0 (for each particular X variable)
Generally, an R-Squared above 0.6 (2)
Multiple linear regression
makes a model worth your attention
Means that most of the variability in Y var can be explained by X var/multiple linear regression model
Step 1 (In SPSS): Writing Regression Equation (2)
The regression equation is:
MRI Count = 237.598 + 55.236(Gender) + 1.280 (PIQ) + 6.515 (Height)
Step 1 (In R): Writing Regression Equation (2)
The regression equation is:
Costs = -3085.657 -86.774(Region) + 511.084(Sex) + 115.61(Age) -2.62(Martial) + 51.16 (Alcohol) + 138.00 (Cigs) -269.264(Exercise)
How can you tell Y and X variables utilised in multiple linear regression model in R? (4)
- Costs = Y
- X = Region, Sex, Age, Marit, Alco
- Data is from ex.data
- This is all stored in variable called model
Step 2: Writing R^2 and Interpreting it (In R) - (2) where R^2 is less than 60% (11.3%)
Multiple linear regression
R^ = 0.113 and so 11.3% of the variation in Y var (name it) can be explained by our multiple linear regression model using X variable (e.g., using X2 and X4 var)
Most of the variation remains unexplained
Step 2: Writing R^2 and Interpreting it (In SPSS)
Multiple linear regression
We see R^2 is 0.618 and so 61.8% of the variability in MRI count is explained by our multiple linear regression model
Step 3 Rule: What P value to include or not?
Multiple linear regression
p </= 0.05 (p less than or equal to 0.05)
Step 3 Rule: How to interpret signifiance in R? (5)
- Anything with ‘ ‘ = significant at 100% (non-sig for mul linear reg)
- Anything with . = significant at 10% (non-sig for mul linear reg)
- Anything with one * = significant at 5%
- Anything with two ** = significant at 1%
- Anything with three *** = significant at 0.1%
Step 3: Interpreting p-value of predictor and whether to include them (In R) - (3)
Multiple linear regression
The coefficient for X2 is significant at 5% level ( p = 0.0397)
whereas the coefficient for X4 is not significant (p = 0.123)
Only X2 should be kept in model
Step 3: Interpreting p-value of predictor and whether to include them (In SPSS) - (3)
The coefficients for Gender, PIQ and Height are all significant at the 5% level or greater, and so all can be kept in the model
How to write B value?
Step 4: Interpreting assumptions - histogram normally disturbed
Multiple linear regression
Step 4: Interpreting assumptions - histogram is not normally disturbed
Multiple linear regression
Step 4 - Interpreting assumptions - scatterplot random scattor
Multiple linear regression
Step 4 - Interpreting assumptions - scatterplot no random
Step 5: Making a prediction and find residual for following squireel
Time = 52.9382 + 21.6954 (Mass) -0.8899 (Length) + 2.9466 + 0.5157(Distance) - (5)
Multiple linear regression
Input values into the equation
Time = 52.9382+21.6954(1.1)+−0.8899(17)+2.9466(1.2)+0.5157(42.37)
Time = 87.060969 (Fitted value)
Residual = Observation - Fitted Value
Residual = 78 (from table) - Fitted
The Kolomorgorv and Smirnov test should be greater than 0.05 so
assumption of normality of errors are satisfied
Written assumptions (2)
Multiple linear regression that is satisfied
The histogram of residuals and normality tests ( p = 0.749 and p = 0.182) suggest that we have no evidence against the assumption of normal errors
The scatterplot of predicted against residuals doesnt show any pattern suggesting the independence and constant variance assumptions on the errors are reasonable.
What would your next steps in modelling confidence based on multiple regression analysis? (4) Grade and income covariates not significantly
Try to remove covariants from the regression
In backgrounds elimination strategy we would remove the least significant covariants (income) and consider its effect on R^2 and signing ace of remaining covariates
Following we could remove grade to see it’s effect on regression model
Best regression model is one with high R^2 with fewest covariates