module 10 Flashcards
Association between 2 numerical variables, controlling for a categorical variable
Can use scatterplot
Association between 2 numerical variables - controlling for a numerical variable
Can use scatterplot
Association between a numerical and categorical variable - controlling for another categorical variable
Can plot a side by side boxplot/violinplot visualization
Simple OLS regression model:
yhat (response variable) = intercept + slope*explanatory variable
Reference/baseline level
a categorical explanatory variable that is not assigned an indicator variable (indicator variable is 0/1)
interpreting a numerical variable slope in a multiple linear regression model
“All else held equal, by increasing the given explanatory variable by 1, we expect the predicted response variable to increase/decrease by NUMBER on average.”
Formal definition of the indicator variable slope
“All else held equal, we expect the predicted response variable value that corresponds to the given indicator variable level to be NUMBER higher/lower than the reference level, on average”
Formal def of the intercept
“We expect the predicted response variable value that corresponds to the observation in which all explanatory and indicator variable values are 0 to be our intercept value, on average.”
When to use interaction terms
If you observe diff slopes between a given numerical explanatory variable and the response variable for different levels
residual
actual - predicted
training dataset
to train the machine learning model; usually 80%
Test dataset
to test the machine learning model that has been fit with the training dataset; may calculate the RMSE of the test dataset
RMSE
The avg error of each response variable in the dataset; we would have no model error for any of our observations; RMSE = 0, the closer the RMSE to 0, the better
SSE
Sum square error (SSE): minimal value of the model; amount of response variable variability in the dataset that is not explained by the model
SST
Sum square total (SST): the total amount of response variable variability in the dataset