module 10 Flashcards

1
Q

Association between 2 numerical variables, controlling for a categorical variable

A

Can use scatterplot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Association between 2 numerical variables - controlling for a numerical variable

A

Can use scatterplot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Association between a numerical and categorical variable - controlling for another categorical variable

A

Can plot a side by side boxplot/violinplot visualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Simple OLS regression model:

A

yhat (response variable) = intercept + slope*explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Reference/baseline level

A

a categorical explanatory variable that is not assigned an indicator variable (indicator variable is 0/1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

interpreting a numerical variable slope in a multiple linear regression model

A

“All else held equal, by increasing the given explanatory variable by 1, we expect the predicted response variable to increase/decrease by NUMBER on average.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Formal definition of the indicator variable slope

A

“All else held equal, we expect the predicted response variable value that corresponds to the given indicator variable level to be NUMBER higher/lower than the reference level, on average”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Formal def of the intercept

A

“We expect the predicted response variable value that corresponds to the observation in which all explanatory and indicator variable values are 0 to be our intercept value, on average.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When to use interaction terms

A

If you observe diff slopes between a given numerical explanatory variable and the response variable for different levels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

residual

A

actual - predicted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

training dataset

A

to train the machine learning model; usually 80%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Test dataset

A

to test the machine learning model that has been fit with the training dataset; may calculate the RMSE of the test dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

RMSE

A

The avg error of each response variable in the dataset; we would have no model error for any of our observations; RMSE = 0, the closer the RMSE to 0, the better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

SSE

A

Sum square error (SSE): minimal value of the model; amount of response variable variability in the dataset that is not explained by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

SST

A

Sum square total (SST): the total amount of response variable variability in the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

SSR

A

Sum square regression (SSR) = SST - SSE; Amt of response variable variability that is explained by the model

17
Q

R^2

A

The percent of response variable variability that is explained by the model; SSR/SST; would like 100%

18
Q

Linear regression assumptions

A

LINE + no multicollinearity:
required:
Linearity: relationship between the Xs and the Y variable should be linear in form
response variable is quantitative

interpretable results:
Multicollinearity: no strong multicollinearity between the X variables

best model assumptions:
Independence: true errors are independent
Normality: true errors are normally distributed
Equal variance: variance of Y at each combination of X is equal