06.b Logistic Regression Flashcards
What is Logistic Regression
The logistic regression is a supervised predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent categorical variable and one or more nominal, ordinal, interval or ratio-level independent variables by estimating probabilities using a logistic function.
What type of output variable comes from Logistic Regression
When the outcome variable is categorical in nature, logistic regression can be used to predict the likelihood of an outcome based on the input variables.
Name four use cases for Logistic Regression
Medical
Finance
Marketing
Engineering
What shape is the common Logistic Curve
An S Shape curve. Bottom left is zero, top right is One, with an S Shape joining the two corners
What is the Logistic Function (equation)
f(y) = e^y / (1+e^y) for -infinity < y < +infinity
What is MLE in terms of Logistic Regression
MLE stands for Maximum Likelihood Estimation
What does churn mean
Churn refers to the likelihood of a customer will switch to another company
Which function should you use for Logistic Regression in R
The Generalised Linear Model function glm()
OutputDF = glm (Churned ~ Age + Married + Cust_Years+Churned_Contacts, data=churn_input, family=bionomial(link=”logit”))
Describe Odds
The Odds of something happening are the chances of A happening divided by the chances of B happening.
Describe Probability
The Probablity of something happening are the chances of A happening divided by the chances of all possible results.
Once you have calculated the Generalised Linear Model for y which equation should you use to calculate the probability
p = e^y / (1-e^y)
What is the Akaike Information Criteria (AIC)
You can look at AIC as counterpart of adjusted r square in multiple linear regression. It’s an important indicator of model fit. It follows the rule: Smaller the better. AIC penalises increasing number of coefficients in the model. It helps to avoid over-fitting.
In Logistic Regression what is the Null Deviance
The Null Deviance is the value where the likelihood function is based only on the intercept term
In Logistic Regression what is the Residual Deviance
The Residual Deviance is the value where the likelihood function is based on the parameters in the specified logistic model
In Logistic Regression how do you calculate a Pseudo - R squared
Pseudo R Squared = 1 - (residual dev. / null dev.)
The Deviance of an observation is calculated how
-2 * log (likelihood of that observation)
What is a confusion matrix
A table of Actual Class (AC) against Predicted Class (PC) showing false and true
PC
Positives (1) Negatives (0)
Positives (1) True Pos False Neg
AC Negatives (0) False Pos True Neg
A good classifer should have high True (Pos&Neg) and low False (Pos & Neg)
What is the true positive rate (TPR)
TPR = TP / (TP +FN)
All the TP divided by all the actual Positives
What is the false positive rate (FPR)
FPR = FP / ( FP + TN)
All the FP divided by all the actual Negatives
What is the true negative rate (TNR)
TNR = TN / (FP + TN)
All the TN divided by all the actual Negatives
What is the false negative rate (FNR)
FNR = FN / (TP + FN)
All of the FN divided by all the actual Positives
What is Accuracy of a Confusion Matrix
Accuracy = TP + TN / (TP + TN + FP + FN)
So the correct ones / everything
What is Precision of a Confusion Matrix
Precision = TP / ( TP + FP)
P for Precsion all of the P’s! TP / all of the positives
What is Recall of a Confusion Matrix
Recall = TP / ( TP + FN )
Which is the same as the TPR
How do you calculate the F Score of a Confusion Matrix
FScore = 2 x ((Precision x Recall)/ (Precision + Recall
When would you consider using Ridge Regression or Lasso Regression
In the case of multicollinearity you could consider using Ridge or Lasso regression because they apply penalties based on the size of the coefficients in an effort to reduce the impact of the multicollinearity
What is a ROC curve
It is a receiver operator curve
It is a plot of the TPR against the FPR
A 45 deg line represents as many correct as wrong
A straight line up the y axis and flat line across the top represents a 100% accuracy
ROC Curve is looking at the Positives!
In logistic regression what is the default threshold
50%
What is another name for a leaf node
A class label