logistic regression Flashcards
disadvantage of linear model?
predicted probabilities may be below 0 or above 1
what does logic(p) equal to?
ln(p/1-p)=β0+β1*x (β1 is the expected increase in log-odds when X increases by one unit)
intercept in odds?
e^β0
slope
e^β1
can estimate β be interpreted as a change in the probability Y=1 associate with unit change in X?
No. Odds not linear
sensitivity?
TP/P (used if FN more costly than FP), RAISE SENSITIVITY BY CLASSIFYING MORE AS ‘YES’ (less FN but more FP, specificity reduced)
true positive rate?
TP/P (Sensitivity 1 – Type 2 error)
false positive rate?
FP/N (1 – Specificity Type 1 error)
positive prediction rate?
TP/hat P (precision)
negative prediction rate?
TN/ hat N
what doesROC (Receiver Operator Characteristic) curve traces out?
true positive rate and false positive rate as we vary the probability threshold from 0 to 1
AUC is the area under the ROC curve. what does it measure?
it measures overall performance of classifier (max AUC=1) the larger the better the classifier
what is the chance line?
random guess can produce the classifier at a 45 degree angle. no classifier should be worse than this line. AUC=0.5
for cross validation, what is used instead of MSEs
number of misclassified observations
converting factor variable for numeric linear regression (has negative values so ignore)
Default$default_yes = ifelse(Default$default == “Yes”, 1, 0)
lm_fit = lm(default_yes ~ balance, data = Default)
summary(lm_fit)
to tell R to use logistic regression,
use family=binomial
e.g.glm_fit1 = glm(default ~ student,
data = Default, family = binomial)
summary(glm_fit1)
to make predictions?
predict(glm_fit1,newdata=data.frame(variable,c(option1,option2 etc)),type=’response’)
find probability of first 10 predictions?
glm_prob = predict(glm_fit2, type = “response”)
glm_prob[1:10]
confusion matrix of probability threshold of 0.5?
confusion_matrix=table(glm_prob>0.5,Default$default)
to find AUC(area under ROC curve)?
pred=prediction(glm_prob,Default$default)
perf=performance(pred, measure=’tpr’, x.measure=’fpr’)
auc_perf=performance(perf, measure=’auc’)
round(auc_perf@y.values[[1]],2)
plot ROC curve with chance line?
plot(perf)
abline(0,1,lwd=1,lty=2)
# Add text to the ROC plot
text(0.4, 0.8, paste(“AUC =”, round(auc_perf@y.values[[1]], 2)))
find accuracy?
accuracy_perf = performance(pred, measure = “acc”)
plot(accuracy_perf, col = “deeppink3”, lwd = 2)
ind = which.max(slot(accuracy_perf, “y.values”)[[1]])
acc = slot(accuracy_perf, “y.values”)[[1]][ind]
cutoff = slot(accuracy_perf, “x.values”)[[1]][ind]
print(c(accuracy = acc, cutoff = cutoff))
add accuracy point on curve?
points(cutoff, acc, type = “p”)
text(0.6, 0.86, “(0.4299, 0.9740)”, cex = 0.85)
most accurate confusion matrix?
confusion_matrix=table(glm_prob>cutoff, Default$default)
confusion_matrix
find point on ROC?
sensitivity = 124 / (124 + 209) specificity = 9615 / (9615 + 52) TPR = sensitivity FPR = 1 - specificity plot(perf) abline(0, 1, lwd = 1, lty = 2) text(0.4, 0.8, paste("AUC =", round(auc_perf@y.values[[1]], 2))) points(FPR, TPR, type = "p", pch = 16)
model validation: validation set approach
set.seed(100) #generate 5000 random numbers from 1-10000 ind=sample(10000,5000) training = Default[ind, ] testing = Default[-ind, ] glm_train = glm(default ~ balance + student, data = training, family = "binomial") summary(glm_train) glm_prob = predict(glm_train, newdata = testing, type = "response") glm_pred = rep("No", 5000) # Classify the individual to the default category if the posterior probability is greater than 0.5 # That is, replace "No" with "Yes" in vector if the predicted probability in "glm_prob" is greater than 0.5 glm_pred = ifelse(glm_prob > 0.5, "Yes", "No") # Confusion matrix table(glm_pred, testing$default)
another method to find accuracy?
accuracy = function(response, predict) { mean((predict <= 0.5) & response == 0 | (predict > 0.5) & response == 1) } #predict<=0.5 & response ==0 are the true positives # We can verify if the cost function is written correctly response = ifelse(testing$default == "Yes", 1, 0) predict = glm_prob accuracy(response, predict)
k-fold CV
set.seed(100)
glm_fit2=glm(y~var1+var2+var3, data=Default, family=binomial)
cv_error = rep(0, 2)
# Store the K-fold error rate into cv_error Use K = 5 and K = 10
cv_error[1] = cv.glm(Default, glm_fit2, accuracy, K = 5)$delta[1]
cv_error[2] = cv.glm(Default, glm_fit2, accuracy, K = 10)$delta[1]
cv_error