Exam 2 Flashcards

1
Q

Correlation (definition, symbol, range, AKA)

A

A standardized measure that indicates how strongly two variables are related to each other.

Represented by r

Ranges from -1 to 1

AKA: Correlation Coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Relationship between correlation and variability in the data

A

Inverse. As variability in the data increases, correlation will decrease.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Simple Linear (OLS) Regression Formula

A

y = β0 + β1(X) + e

Y: our best guess given X
β0: intercept
β1: slope/ regression coefficient
X: input
e: error term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

e (names, definition, interpretation)

A

Error term or Residual

The difference between the actual observed value of the dependent variable and the value predicted by the regression model.

Represents dispersion/variability → Inverse relationship with correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

β1 (names, interpretation)

A

Slope or regression coefficient

For a one unit increase in x, there is a β1 unit increase in Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Ordinary Least Squares Regression Method

A

Calculate the line the minimizes the sum of the squared residuals (SSR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

OLS Regression Assumptions (5)

A
  • There is a linear relationship
  • The observations are independent
  • The errors (e) are normally distributed with mean 0
  • The errors are homoscedastic (the variance of errors does’t change)
  • The dependent variable is a continuous numeric value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

SST (name, definition, formula)

A

Total Variability

The difference between the mean and a given y value

y - ȳ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

SSE (name, definition, formula)

A

Explained Variability

The amount of variability from the mean is explained by the model

ŷ - ȳ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

SSR (name, definition, formula)

A

Residual Variability

The amount of variability from the mean that cannot be explained by the model

y - ŷ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

R Squared (Formula and definition)

A

SSE/SST

The proportion of variance in Y that can be explained by X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Significance Testing the Regression Coefficient

A

Testing whether the coefficient is significantly different from 0.

Tells us whether the relationship between X and Y is significant.

A higher coefficient and lower variability will decrease the p-value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Multiple Regresson Formula

A

Y = β0 + β1(X1) + β2(X2) +…+ βn(Xn) + e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

βn in Multiple Regression

A

The effect of Xn on Y, HOLDING ALL OTHER VARIABLES CONSTANT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

R2 in Multiple Regression

A

The proportion of the variance in Y that is explained by all independent variables in the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Partial R2

A

The proportion of the variance in Y that is explained by one independent variable, HOLDING ALL OTHER VARIABLES CONSTANT

17
Q

Cohen’s D in Multiple Regression

A

Effect Size

Used when units are different among X variables.

< 0.2 = ignored
< 0.5 = small
< 0.8 = medium
< 1.3 = large
1.3+ = very large

18
Q

Adjusted R2

A

Tells us the predictive power of the model for data outside the sample.

Decreases when a predictor is added that does not improve the model.

19
Q

Potential Issues with Multiple Regressions (2)

A

Multicollinearity
Overfitting

20
Q

Multicollinearity

A

High correlation between two or more predictor variables creates redundancy

VIF = 1: no effect of multicollinearity
VIF > 1: Moderate effect of multicollinearity
VIF > 5: High effect of multicollinearity
VIF > 10: major effect of multicollinearity (X should be removed)

21
Q

Overfitting

A

When a model has so many X variables that it becomes overly complex, learning idiosyncratic patterns of a
particular sample that may not generalize to the general
population

Indicated by a high cohen’s f2 but a low change in adjusted R2

22
Q

Cohen’s f2

A

How much R2 changes when the variable is added to the model

23
Q

Adjusted R2 Δ

A

How much adjusted R2 changes when the variable is added to the model

24
Q

Bias/Variance Trade Off

A

High-bias (Underfitting): A simple linear regression model trying to fit a complex nonlinear relationship may fail to capture the data’s structure, leading to errors.

High-variance (Overfitting): A high-degree polynomial regression may perfectly fit the training data but perform poorly on new data due to capturing noise.

25
Q

Categorical Significance (Reference)

A

Intercept = Reference group
Coefficients = Difference in group mean from reference group mean
Null Hypothesis = Difference is 0

26
Q

Categorical Significance (GHLT Pairwise)

A

Coefficients = difference between each group
Null Hypothesis: difference between each group is 0

27
Q

Categorical Significance (No Intercept)

A

Coefficients = group means
Null Hypothesis: Each group mean is difference from 0

28
Q

Why do we use GHLT instead of running multiple tests?

A

When we run multiple hypothesis test at a time, there is a higher chance of encountering a Type 1 error. GLHT automatically adjusts the p-value to account for this.