logistic regression Flashcards by Eliza Ong

disadvantage of linear model?

predicted probabilities may be below 0 or above 1

How well did you know this?

Not at all

Perfectly

what does logic(p) equal to?

ln(p/1-p)=β0+β1*x (β1 is the expected increase in log-odds when X increases by one unit)

How well did you know this?

Not at all

Perfectly

intercept in odds?

e^β0

How well did you know this?

Not at all

Perfectly

slope

e^β1

How well did you know this?

Not at all

Perfectly

can estimate β be interpreted as a change in the probability Y=1 associate with unit change in X?

No. Odds not linear

How well did you know this?

Not at all

Perfectly

sensitivity?

TP/P (used if FN more costly than FP), RAISE SENSITIVITY BY CLASSIFYING MORE AS ‘YES’ (less FN but more FP, specificity reduced)

How well did you know this?

Not at all

Perfectly

true positive rate?

TP/P (Sensitivity 1 – Type 2 error)

How well did you know this?

Not at all

Perfectly

false positive rate?

FP/N (1 – Specificity Type 1 error)

How well did you know this?

Not at all

Perfectly

positive prediction rate?

TP/hat P (precision)

How well did you know this?

Not at all

Perfectly

negative prediction rate?

TN/ hat N

How well did you know this?

Not at all

Perfectly

what doesROC (Receiver Operator Characteristic) curve traces out?

true positive rate and false positive rate as we vary the probability threshold from 0 to 1

How well did you know this?

Not at all

Perfectly

AUC is the area under the ROC curve. what does it measure?

it measures overall performance of classifier (max AUC=1) the larger the better the classifier

How well did you know this?

Not at all

Perfectly

what is the chance line?

random guess can produce the classifier at a 45 degree angle. no classifier should be worse than this line. AUC=0.5

How well did you know this?

Not at all

Perfectly

for cross validation, what is used instead of MSEs

number of misclassified observations

How well did you know this?

Not at all

Perfectly

converting factor variable for numeric linear regression (has negative values so ignore)

Default$default_yes = ifelse(Default$default == “Yes”, 1, 0)
lm_fit = lm(default_yes ~ balance, data = Default)
summary(lm_fit)

How well did you know this?

Not at all

Perfectly

to tell R to use logistic regression,

Study These Flashcards

use family=binomial
e.g.glm_fit1 = glm(default ~ student,
data = Default, family = binomial)
summary(glm_fit1)

to make predictions?

Study These Flashcards

predict(glm_fit1,newdata=data.frame(variable,c(option1,option2 etc)),type=’response’)

find probability of first 10 predictions?

Study These Flashcards

glm_prob = predict(glm_fit2, type = “response”)

glm_prob[1:10]

confusion matrix of probability threshold of 0.5?

Study These Flashcards

confusion_matrix=table(glm_prob>0.5,Default$default)

to find AUC(area under ROC curve)?

Study These Flashcards

pred=prediction(glm_prob,Default$default)
perf=performance(pred, measure=’tpr’, x.measure=’fpr’)
auc_perf=performance(perf, measure=’auc’)
round(auc_perf@y.values[[1]],2)

plot ROC curve with chance line?

Study These Flashcards

plot(perf)
abline(0,1,lwd=1,lty=2)
# Add text to the ROC plot
text(0.4, 0.8, paste(“AUC =”, round(auc_perf@y.values[[1]], 2)))

find accuracy?

Study These Flashcards

accuracy_perf = performance(pred, measure = “acc”)
plot(accuracy_perf, col = “deeppink3”, lwd = 2)
ind = which.max(slot(accuracy_perf, “y.values”)[[1]])
acc = slot(accuracy_perf, “y.values”)[[1]][ind]
cutoff = slot(accuracy_perf, “x.values”)[[1]][ind]
print(c(accuracy = acc, cutoff = cutoff))

add accuracy point on curve?

Study These Flashcards

points(cutoff, acc, type = “p”)

text(0.6, 0.86, “(0.4299, 0.9740)”, cex = 0.85)

most accurate confusion matrix?

Study These Flashcards

confusion_matrix=table(glm_prob>cutoff, Default$default)

confusion_matrix

find point on ROC?

``` sensitivity = 124 / (124 + 209) specificity = 9615 / (9615 + 52) TPR = sensitivity FPR = 1 - specificity plot(perf) abline(0, 1, lwd = 1, lty = 2) text(0.4, 0.8, paste("AUC =", round(auc_perf@y.values[[1]], 2))) points(FPR, TPR, type = "p", pch = 16) ```

model validation: validation set approach

``` set.seed(100) #generate 5000 random numbers from 1-10000 ind=sample(10000,5000) training = Default[ind, ] testing = Default[-ind, ] glm_train = glm(default ~ balance + student, data = training, family = "binomial") summary(glm_train) glm_prob = predict(glm_train, newdata = testing, type = "response") glm_pred = rep("No", 5000) # Classify the individual to the default category if the posterior probability is greater than 0.5 # That is, replace "No" with "Yes" in vector if the predicted probability in "glm_prob" is greater than 0.5 glm_pred = ifelse(glm_prob > 0.5, "Yes", "No") # Confusion matrix table(glm_pred, testing$default) ```

another method to find accuracy?

``` accuracy = function(response, predict) { mean((predict <= 0.5) & response == 0 | (predict > 0.5) & response == 1) } #predict<=0.5 & response ==0 are the true positives # We can verify if the cost function is written correctly response = ifelse(testing$default == "Yes", 1, 0) predict = glm_prob accuracy(response, predict) ```

k-fold CV

set.seed(100) glm_fit2=glm(y~var1+var2+var3, data=Default, family=binomial) cv_error = rep(0, 2) # Store the K-fold error rate into cv_error Use K = 5 and K = 10 cv_error[1] = cv.glm(Default, glm_fit2, accuracy, K = 5)$delta[1] cv_error[2] = cv.glm(Default, glm_fit2, accuracy, K = 10)$delta[1] cv_error

logistic regression Flashcards

(28 cards)