module 11 Flashcards
logistic regression model
predict a categorical response variable w 2 levels
simple logistic regression model equation
phat (probability) = 1/1+e^-(intercept + slopeexplanatory variable)
odds = phat/1-phat = e^(intercept + slopeexplanatory variable)
log odds = log(phat/ 1-phat)
intercept interpretation
“We predict the odds of INCLUDE ALL CHARACTERISITCS FROM VARIABLE of being SUCCESSFUL are NUMBER”
Interpret e^intercept: the baseline odds
Numerical Explanatory Variable Slope interpretation
“All else held equal if we were to increase the VARIABLE by 1, then we would expect the odds of SUCCESS to increase by a multiplicative factor of NUMBER on average.”
Interpret e^slope: odds multiplier of the explanatory variable
Indicator variable slope interpretation
“All else held equal we expect that the odds INTERACTION TERM INCLUDED is SUCCESSto be a multiple of ODDS RATIO times higher than the odds BASELINE LEVEL is SUCCESS, on average.”
Need to calculate the log of the ratio of these 2 odds
log(oddyes/oddno) = log(oddsyes) - log(oddsno)
Convert back to get odds ratio
Pseudo R^2
R^2 = 1 - LLF_full/LLF_null
LLF_full
The highest possible log likelihood function value that we could achieve with intercept and slope
LF_full = 1, so LLF_full = 0 ideally bc ln(1) = 0 and LLF_full is log(LF_full)
The closer to 0, the better the fit for the training set
Classifier model
set of rules that decide which of a set categories an observation belongs to, on the basis of training set of data containing observations whose category membership is known
Predictive positive
True positive: observation predicted to be positive is actually positive
False positive: observation predicted to be positive is actually negative
Predictive negative
True negative: observation predicted to be negative is actually negative
False negative: observation predicted to be negative is actually positive
Confusion matrix
Predicted on top and actual on side
Sensitivity rate
(true positive rate): TP/TP+FN
Specificity rate
(true negative rate): TN/TN+FP
assumption for log
Response variable needs to be a categorical variable with 2 possible outcomes
The relationship between the log odds of success and the combination of Xs should be linear
The observations need to be independent (inference)
No multicollinearity between the x variables (slope interpretability)
The sample size is large enough that to support the normal approximation
No strong outliers or influential points (inference)