WEEK 4 Flashcards
Post week learning reflection
What is a response variable? Give an example
The subject we are interested in
Eg. In a study about plant growth and sun exposure the response is the plant growth
What is a predictor variable? Give an example
The factor(s) that may impact, directly or indirectly, our response variable
Eg. In a study about plant growth and sun exposure the predictor is sun exposure
What is the slope?
The rate in which the response variable changes with a unit change in the predictor
Define residuals and how is it calculated?
Its the differences between observed values of the response variable and the values predicted by the model
ei = yi - y(hat)i
where ei is the residual for the ith observation, yi is the observed response, and y(hat)i is the predicted response.
What is the coefficient of determination? and what does it do
The coefficient of determination is R^2 and it quantifies the proportion of the variance in the dependent variable that is predicted from the independent variables
R^2 = 1 - SSres / SS tot
How can we tell if the model is well-fitted?
The residuals should randomly scatter around zero with no discernible patterns when plotted against predicted values or by any of the independent variables. Which suggests that the model does not suffer from non-linearity, heteroscedasticity, or other issues that could affect the reliability of the predictions.
Define unexplained variation
It is the sum of squares of the residuals, known as the residual sum of squares SSres
Define the total sum of squares
SStot measures the total variance in the observed data. Equivalent to the sum of squares of the difference between the observed values and the mean of the observed data
Define the explained variation
SSexp is the part of the total variation in the response variable that is explained by the regression model. Represented by the difference between the total sum of squares and the residual sum of squares
SSexp = SStot - SSres
What values can R^2 take and what do they mean?
An R^2 of 0 indicates that the model does not explain any of the variability of the response data around its mean.
An R^2 of 1 indicates that the model explains all the variability of the response data around its mean.
In other words, the closer the R2 value is to 1, the better the model fits the data
What are some limitation of using R^2 to determine how good the model fits?
It will never decrease upon adding more predictors, which can lead to overfitting if not careful
It does not indicate whether the model is adequate or whether every predictor in the model is significant.
R^2 does not provide information on the correctness of the model structure or about the quality of the predictions for new data points.
Explain ANOVA for regression tests
It assesses whether any of the predictors in a multiple regression model contribute to explaining the variability in the response variable. It compares a model with all predictors included against a reduced model with only the intercept (response mean).
ANOVA for regression uses the F-statistic, which is the ratio of the mean square regression (MSR) to the mean square error (MSE).
Explain t-test for regression coefficients
Each regression coefficient can be tested individually using a t-test to determine if it is significantly different from zero. The null and alternative hypotheses for each predictor are:
H0: βj=0 (The predictor xj has no effect on the response variable.)
HA: βj≠0 (The predictor xj does have an effect on the response variable.)
CALCULATED AS
t=βj−0/ SE(βj)
where βj is the estimated coefficient and SE(βj) is the standard error of the coefficient. This t-statistic follows a t-distribution with n−p−1 degrees of freedom, where n is the sample size and p is the number of predictors.
What are the equations for an F-statistic test
F = MSR/MSE
Where:
MSR = SSreg/DFreg
MSE = SSres/DFres
F-statistic follows an F-distribution with DFreg and DFres
What are the 3 tests we must do for model validation?
Linearity: The relationship between predictors and the response should be linear.
Normality: The residuals should be normally distributed.
Homoscedasticity: The variance of residuals should be constant across predicted values.