WEEK 4 Flashcards

Post week learning reflection

1
Q

What is a response variable? Give an example

A

The subject we are interested in
Eg. In a study about plant growth and sun exposure the response is the plant growth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a predictor variable? Give an example

A

The factor(s) that may impact, directly or indirectly, our response variable
Eg. In a study about plant growth and sun exposure the predictor is sun exposure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the slope?

A

The rate in which the response variable changes with a unit change in the predictor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define residuals and how is it calculated?

A

Its the differences between observed values of the response variable and the values predicted by the model
ei = yi - y(hat)i
where ei​ is the residual for the ith observation, yi​ is the observed response, and y(hat)​i​ is the predicted response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the coefficient of determination? and what does it do

A

The coefficient of determination is R^2 and it quantifies the proportion of the variance in the dependent variable that is predicted from the independent variables

R^2 = 1 - SSres / SS tot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can we tell if the model is well-fitted?

A

The residuals should randomly scatter around zero with no discernible patterns when plotted against predicted values or by any of the independent variables. Which suggests that the model does not suffer from non-linearity, heteroscedasticity, or other issues that could affect the reliability of the predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define unexplained variation

A

It is the sum of squares of the residuals, known as the residual sum of squares SSres

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define the total sum of squares

A

SStot measures the total variance in the observed data. Equivalent to the sum of squares of the difference between the observed values and the mean of the observed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define the explained variation

A

SSexp is the part of the total variation in the response variable that is explained by the regression model. Represented by the difference between the total sum of squares and the residual sum of squares
SSexp = SStot - SSres

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What values can R^2 take and what do they mean?

A

An R^2 of 0 indicates that the model does not explain any of the variability of the response data around its mean.
An R^2 of 1 indicates that the model explains all the variability of the response data around its mean.
In other words, the closer the R2 value is to 1, the better the model fits the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some limitation of using R^2 to determine how good the model fits?

A

It will never decrease upon adding more predictors, which can lead to overfitting if not careful
It does not indicate whether the model is adequate or whether every predictor in the model is significant.
R^2 does not provide information on the correctness of the model structure or about the quality of the predictions for new data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain ANOVA for regression tests

A

It assesses whether any of the predictors in a multiple regression model contribute to explaining the variability in the response variable. It compares a model with all predictors included against a reduced model with only the intercept (response mean).
ANOVA for regression uses the F-statistic, which is the ratio of the mean square regression (MSR) to the mean square error (MSE).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain t-test for regression coefficients

A

Each regression coefficient can be tested individually using a t-test to determine if it is significantly different from zero. The null and alternative hypotheses for each predictor are:
H0​: βj​=0 (The predictor xj​ has no effect on the response variable.)
HA​: βj≠0 (The predictor xj​ does have an effect on the response variable.)
CALCULATED AS
t=βj​−0/ SE(βj​​)​
where βj​ is the estimated coefficient and SE(βj​​) is the standard error of the coefficient. This t-statistic follows a t-distribution with n−p−1 degrees of freedom, where n is the sample size and p is the number of predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the equations for an F-statistic test

A

F = MSR/MSE
Where:
MSR = SSreg/DFreg
MSE = SSres/DFres
F-statistic follows an F-distribution with DFreg and DFres

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 3 tests we must do for model validation?

A

Linearity: The relationship between predictors and the response should be linear.
Normality: The residuals should be normally distributed.
Homoscedasticity: The variance of residuals should be constant across predicted values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What plots can we use to visually check the model?

A

Residuals vs Fitted: This plot checks the assumption of linearity and homoscedasticity. Ideally, the residuals should be randomly dispersed around the horizontal line (red line), indicating a linear relationship and constant variance (homoscedasticity).
Q-Q Plot (Quantile-Quantile): The Q-Q plot checks whether the residuals are normally distributed. Points following the dashed line indicate normality.
Scale-Location (or Spread-Location): This plot also checks homoscedasticity by showing if residuals are spread equally across all levels of fitted values. The red line should be flat and horizontal.
Residuals vs Leverage: This helps identify influential observations that have a disproportional impact on the model. Points outside the dashed Cook’s distance lines might be influential.

17
Q

Summarize OLS Regression

A

It is a statistical technique used to estimate the relationships between a dependent variable and one or more independent variables. The OLS method minimizes the sum of the squared differences between the observed and predicted values.

18
Q

Summarize Model interpretation

A

We examined the coefficients, which quantify the effect size of each predictor. A coefficient indicates the expected change in the response variable for a one-unit change in the predictor, assuming other variables are held constant.

19
Q

Summarize the goodness of fit

A

The R^2 value was discussed as a measure of how well our model explains the variance in the response variable. A high 2R2 value suggests that the model captures a significant portion of the variance.