Exam 2 Flashcards
Correlation (definition, symbol, range, AKA)
A standardized measure that indicates how strongly two variables are related to each other.
Represented by r
Ranges from -1 to 1
AKA: Correlation Coefficient
Relationship between correlation and variability in the data
Inverse. As variability in the data increases, correlation will decrease.
Simple Linear (OLS) Regression Formula
y = β0 + β1(X) + e
Y: our best guess given X
β0: intercept
β1: slope/ regression coefficient
X: input
e: error term
e (names, definition, interpretation)
Error term or Residual
The difference between the actual observed value of the dependent variable and the value predicted by the regression model.
Represents dispersion/variability → Inverse relationship with correlation
β1 (names, interpretation)
Slope or regression coefficient
For a one unit increase in x, there is a β1 unit increase in Y
Ordinary Least Squares Regression Method
Calculate the line the minimizes the sum of the squared residuals (SSR)
OLS Regression Assumptions (5)
- There is a linear relationship
- The observations are independent
- The errors (e) are normally distributed with mean 0
- The errors are homoscedastic (the variance of errors does’t change)
- The dependent variable is a continuous numeric value
SST (name, definition, formula)
Total Variability
The difference between the mean and a given y value
y - ȳ
SSE (name, definition, formula)
Explained Variability
The amount of variability from the mean is explained by the model
ŷ - ȳ
SSR (name, definition, formula)
Residual Variability
The amount of variability from the mean that cannot be explained by the model
y - ŷ
R Squared (Formula and definition)
SSE/SST
The proportion of variance in Y that can be explained by X
Significance Testing the Regression Coefficient
Testing whether the coefficient is significantly different from 0.
Tells us whether the relationship between X and Y is significant.
A higher coefficient and lower variability will decrease the p-value.
Multiple Regresson Formula
Y = β0 + β1(X1) + β2(X2) +…+ βn(Xn) + e
βn in Multiple Regression
The effect of Xn on Y, HOLDING ALL OTHER VARIABLES CONSTANT
R2 in Multiple Regression
The proportion of the variance in Y that is explained by all independent variables in the model
Partial R2
The proportion of the variance in Y that is explained by one independent variable, HOLDING ALL OTHER VARIABLES CONSTANT
Cohen’s D in Multiple Regression
Effect Size
Used when units are different among X variables.
< 0.2 = ignored
< 0.5 = small
< 0.8 = medium
< 1.3 = large
1.3+ = very large
Adjusted R2
Tells us the predictive power of the model for data outside the sample.
Decreases when a predictor is added that does not improve the model.
Potential Issues with Multiple Regressions (2)
Multicollinearity
Overfitting
Multicollinearity
High correlation between two or more predictor variables creates redundancy
VIF = 1: no effect of multicollinearity
VIF > 1: Moderate effect of multicollinearity
VIF > 5: High effect of multicollinearity
VIF > 10: major effect of multicollinearity (X should be removed)
Overfitting
When a model has so many X variables that it becomes overly complex, learning idiosyncratic patterns of a
particular sample that may not generalize to the general
population
Indicated by a high cohen’s f2 but a low change in adjusted R2
Cohen’s f2
How much R2 changes when the variable is added to the model
Adjusted R2 Δ
How much adjusted R2 changes when the variable is added to the model
Bias/Variance Trade Off
High-bias (Underfitting): A simple linear regression model trying to fit a complex nonlinear relationship may fail to capture the data’s structure, leading to errors.
High-variance (Overfitting): A high-degree polynomial regression may perfectly fit the training data but perform poorly on new data due to capturing noise.
Categorical Significance (Reference)
Intercept = Reference group
Coefficients = Difference in group mean from reference group mean
Null Hypothesis = Difference is 0
Categorical Significance (GHLT Pairwise)
Coefficients = difference between each group
Null Hypothesis: difference between each group is 0
Categorical Significance (No Intercept)
Coefficients = group means
Null Hypothesis: Each group mean is difference from 0
Why do we use GHLT instead of running multiple tests?
When we run multiple hypothesis test at a time, there is a higher chance of encountering a Type 1 error. GLHT automatically adjusts the p-value to account for this.