Logistic regression Flashcards
Normal regression equation
- Ŷ = bX + c
- This is the linear regression model equation, make sure to know this
- Ŷ is the outcome variable, “the probability of having one outcome or another based on a nonlinear function of the best linear combination of predictors” (Tabachnick and Fidel).
- Ŷ-Y is the residual(s)?
- where X is the predictor variable
- The slope of the line is b
- c is the intercept (the value of y when x = 0)
Types of research questions for logistic regression
• Can predict the presence or absence of a disorder/disease?
• Can we predict an outcome using a set of predictors?
o How good is the model?
• Does an individual predictor increase or decrease the probability of an outcome?
o Related to the importance of the predictors
• Can be used for classification and prediction
• Simple categorical outcomes
o Can we predict the outcomes using categorical predictors?
How does logistic regression differ from ordinary least squares regression?
• OLS has 3 important characteristics:
o The model is linear
o Residuals are assumed to be normally and homogenously distributed
o Predicted scores (Ŷ) are on the same scale as the data (Y)
• These characteristics don’t apply to logistic regression
The model is not a linear prediction, it is dichotomous. Better to use a ‘logistic’ function, sigmoidal shape fits the data better.
- There will be non-normality and heteroscedasticity in the residuals if OLS regression is used, which violates important assumptions of this method
The model is a probability value, and thus is on a different scale to the data
What is probability?
• Probability: the likelihood of an event occurring
o If p = .80, there is an 80% chance of that event occurring
What are predicted odds?
• Predicted odds: the probability of an event occurring divided by the probability of it not occurring
o Predicted odds = prob of event occurring/ prob of event not occurring
o Following on from p = .80, that means the probability of it not occurring is .2 (i.e. 1-the likelihood of it occuring)
o .8/.2 = 4
o This means the odds were 4:1 in favour of the event occurring
The logistic model give pi which is likelihood of an outcome occurring, so predicted odds is pi/1-pi
• Odds are asymmetric
so the observed odds ratio is not in the centre of the confidence interval, but we can use the natural log of the odds instead
o Log of odds = Logit
What is the odds ratio?
the odds of an event occurring across levels of another variable
o By how much do the odds of Y change as X increased by 1 unit
o Essentially it is a ratio of ratios
o Measure of effect size is central here; a good way of measuring the strength of the relationship.
Logistic regression equation
P-hat (subscript) i = 1 / 1 + e (to the power of) –(B1X1+C)
What is pi?
• Our model is of pî rather than Ŷ
o Pî is the estimated probability of the outcome i occurring (this is different to the predicted odds, which has another equation)
Predicted odds vs logit
They are just transformations of each other
• Predicted odds: odds of being a case
o Odds = p/(1-p), which ranges from 0 to positive infinity
o When p is .50, the odds are 1 (even odds, 1:1)
.50/(1-.50) = .50/.50 = 1
o When p > .50, the odds >1
o Varies exponentially (not linearly, it’s increasingly rapid?) with the predictor(s)
• Logit: natural logarithm of the odds
o Ranges from negative infinity to positive infinity
o Reflects odds of being a case but varies linearly with predictor(s)
o Not very interpretable
If p =.8, the odds = 4 but the logit = 1.386
2 kinds of regression coefficient in logistic regression
• Typical partial regression coefficients (B)
o Identical in function to OLS regression
o Indicates increment in the logit given unit increment in predictor
• Odds ratios (eB)
o Exponential B Indicates the amount by which odds of being a case are multiplied given a unit increment in predictor (or change in level of predictor if the predictor is categorical)
o If B = 0, eB = 1, the predictor has no relationship
Estimating parameters in logistic regression
• Logistic regression uses maximum likelihood estimation, which is an iterative solution
o Regression coefficients are estimated by trial-and-error and gradual adjustment
Seeks to maximise the likelihood (L) of the observed values of Y given a model and using the observed values of the predictors
What is the log-likelihood
Log Likelihoods
• To evaluate the model, a log likelihood (LL) value can be calculated for each model we test
• The LL is a function of the probabilities of the observed and model-predicted outcomes for each case, summed over all cases
• We can directly compare the goodness-of-fit of different models using the log likelihoods
How is model fit tested in logistic regression?
- log-likelihood ratio test - a test of model fit
- Significant likelihood ratio test tells us that the model is significantly worse with the corresponding predictor removed, thus the predictor should be retained in the model. If non-sig, you can probably remove that predictor.
How does the log-likelihood ratio test work?
Won’t be asked directly about this but need to know it for questions where you have to report results- will be good if you can interpret model fit statistics
• In likelihood ratio test, we test the null deviance (including only the constant) against the model deviance (containing k predictors)
• As k increases, the difference between the null and model deviance will generally increase, which improves the model fit
• If there is no significant improvement in fit when we add the k predictors to the model, we need to question the inclusion of those predictors
• If there is no significant deterioration in fit when we remove k predictors from the model, then we need to question the inclusion of those predictors
o I.e. they are redundant in the context of this outcome variable
• Only accept more predictors if they increase the significance of the model