logistic regression Flashcards
disadvantage of linear model?
predicted probabilities may be below 0 or above 1
what does logic(p) equal to?
ln(p/1-p)=β0+β1*x (β1 is the expected increase in log-odds when X increases by one unit)
intercept in odds?
e^β0
slope
e^β1
can estimate β be interpreted as a change in the probability Y=1 associate with unit change in X?
No. Odds not linear
sensitivity?
TP/P (used if FN more costly than FP), RAISE SENSITIVITY BY CLASSIFYING MORE AS ‘YES’ (less FN but more FP, specificity reduced)
true positive rate?
TP/P (Sensitivity 1 – Type 2 error)
false positive rate?
FP/N (1 – Specificity Type 1 error)
positive prediction rate?
TP/hat P (precision)
negative prediction rate?
TN/ hat N
what doesROC (Receiver Operator Characteristic) curve traces out?
true positive rate and false positive rate as we vary the probability threshold from 0 to 1
AUC is the area under the ROC curve. what does it measure?
it measures overall performance of classifier (max AUC=1) the larger the better the classifier
what is the chance line?
random guess can produce the classifier at a 45 degree angle. no classifier should be worse than this line. AUC=0.5
for cross validation, what is used instead of MSEs
number of misclassified observations
converting factor variable for numeric linear regression (has negative values so ignore)
Default$default_yes = ifelse(Default$default == “Yes”, 1, 0)
lm_fit = lm(default_yes ~ balance, data = Default)
summary(lm_fit)
to tell R to use logistic regression,
use family=binomial
e.g.glm_fit1 = glm(default ~ student,
data = Default, family = binomial)
summary(glm_fit1)
to make predictions?
predict(glm_fit1,newdata=data.frame(variable,c(option1,option2 etc)),type=’response’)
find probability of first 10 predictions?
glm_prob = predict(glm_fit2, type = “response”)
glm_prob[1:10]
confusion matrix of probability threshold of 0.5?
confusion_matrix=table(glm_prob>0.5,Default$default)
to find AUC(area under ROC curve)?
pred=prediction(glm_prob,Default$default)
perf=performance(pred, measure=’tpr’, x.measure=’fpr’)
auc_perf=performance(perf, measure=’auc’)
round(auc_perf@y.values[[1]],2)
plot ROC curve with chance line?
plot(perf)
abline(0,1,lwd=1,lty=2)
# Add text to the ROC plot
text(0.4, 0.8, paste(“AUC =”, round(auc_perf@y.values[[1]], 2)))
find accuracy?
accuracy_perf = performance(pred, measure = “acc”)
plot(accuracy_perf, col = “deeppink3”, lwd = 2)
ind = which.max(slot(accuracy_perf, “y.values”)[[1]])
acc = slot(accuracy_perf, “y.values”)[[1]][ind]
cutoff = slot(accuracy_perf, “x.values”)[[1]][ind]
print(c(accuracy = acc, cutoff = cutoff))
add accuracy point on curve?
points(cutoff, acc, type = “p”)
text(0.6, 0.86, “(0.4299, 0.9740)”, cex = 0.85)
most accurate confusion matrix?
confusion_matrix=table(glm_prob>cutoff, Default$default)
confusion_matrix