Extensions Of Multiple Regression Flashcards
What is high leverage point vs outlier?
- high-leverage point is an extreme value for one of the independent variables
- outlier is an extreme value for the dependent variable
What’s are 2 methods for detecting influential data points? Which method is for high leverage points and which method is for outliers? LS
- leverage measure (high leverage point/ independent variable)
- studentized residuals (outliers/dependent variables)
What’s formula for leverage when trying to detect influential data points and when is the outlier considered to be significant? What is the leverage measure for?
hii = observation - mean value
if hii > 3*(k+1/n) an observation is considered to be significant
k = number of independent variables
n = number of observations
- for identifying high-leverage points. It takes a value between 0 and 1 that quantifies the distance between the ith value for an independent variable and the mean value. A higher value indicates more influence.
What is the studentized residuals method steps for detecting influential data points?
- runs regression
- Deletes 1 observation and re-runs regression
- residual (ei) = observed data point before re running regression - new regression line after re running the regression
- calculate residual for each observation in data set
- calculate standard deviation of the residuals
- studentized residual (ti) = ei/ standard deviation of residuals
What does studentized residual measure?
ti (studentized residual) measures # of standard deviations away from regression line
When is the studentized residual considered an outlier?
if the studentized residuals absolute value (if negative turn to positive) (ti) is greater than 3 or greater than the critical value of the t statistic
t statistic = n-k-2
n = # of observations
k = # of dependent variables
What are the 2 values that dummy variables take on?
1 if true
0 if false
What do you need to use when there are more than 2 categories and why?
- to avoid multicollinearity. using n-1 dummy variables, you create a model where one category is implicitly represented by the absence of all other dummy variables, which avoids the multicollinearity problem
What are the 3 types of dummy variables?
- intercept dummies
- slope dummies
- intercept and slope dummies
How does adding an intercept dummy change a single linear regression formula?
single linear regression
y = b0 + b1x
single linear regression with dummy variable 0
y = b0 + b1x
single linear regression with dummy variable 1
y = b0 + dummy variable + b1x
How does adding an intercept dummy variable affect the regression on a graph?
if intercept dummy variable is 1 then it’ll move the intercept or regression line up but it’ll be parallel to a simple linear regression
How does adding a slope dummy change a single linear regression formula?
single linear regression
y = b0 + b1x
single linear regression with slope dummy variable 0
y = b0 + b1x
single linear regression with slope dummy variable 1
y = b0 + (b1+d1)x
d1 = dummy variable of 1
How does adding a slope dummy variable affect the regression on a graph?
intercept stays the same but with a dummy variable the regression line becomes steeper
If p value is greater than or less than 5% when can you reject and when can you not reject?
p value > 0.05 don’t reject the null hypothesis
p value < 0.05 reject the null hypothesis (indicates a significant result)
What’s the formula for odds of an event occurring?
P / (1-p)
p = probability
What’s logit odds and why use logit odds?
logistic regression (logit)
log odds = In (P/1-p)
we use log odds because regular odds the outcome value or regression line can be greater than 1. where as log odds transforms the data so the probability of odds falls between 0 and 1 and the regression line stays within 0 and 1
What is the regression equation for logistic transformation or logit and how can it be reorganized to isolate the probability?
In (p / 1-p) = b0 +b1X1 + b2X2 + b3X3 + e
P = 1 / (1 + exponent (- b0 +b1X1 + b2X2 + b3X3 ))
b = slope coefficient
x = mean of independent variable
What is Maximum likelihood estimation (MLE)?
- statistical method for finding the parameters of a distribution that best fit a set of observed data, in essence we want to move our distribution over a set of data that will include the most amount of our data points which is usually the mean.
What is the likelihood ratio (LR) test, ratio, and what does high vs low LR mean, and what are the ranges an LR can take on?
LR is always negative
LR = assesses how well two statistical models fit a dataset by comparing their likelihoods (the probability of observing the data point given the models). Essentially, it helps determine if a more complex model significantly improves the fit over a simpler one
LR higher = better fit
LR = -2 (log likelihood restricted model - log likelihood unrestricted model)