Class 9 - Predictive Analytics Flashcards
What is classification defined?
Classification is a predictive analytics technique used to separate or classify a sample (or population) into two or more groups or classes. We are predicting a probabilistic outcome, or what might happen based on our forecasts.
What are categorical dependent variables and the estimation methods?
Binary (categorical) dependent variables: 1=Yes, 0=No
Estimation methods: Linear regression (Linear probability model), Logistic regression
What is LPM and its pros and cons?
Use linear regression to estimate models with a binary (categorical) dependent variable (1=Yes, 0=No)
Pros:
Easy (you can run it using Excel built-in ToolPak)
Can handle models with many explanatory variables and large data sets.
Cons:
The error term will follow a binomial, instead of normal, distribution.
The variance of the error team will vary with the explanatory variables.
The predicted probabilities will not be bounded by 0 and 1.
What is Logit Model and its pros and cons?
Use logistic regression to estimate models with a binary (categorical) dependent variable (1=Yes, 0=No).
Pros:
Solve all the problems with LPM.
Cons:
Difficult to estimate, if the model is complicated and the data set is large.
Need to use the RegressIt Logistic Excel add-in
Why is logistic regression better for binary outcomes than linear regression?
Logistic regression ensures predictions are bounded between 0 and 1, while linear regression does not.
Explain Logistic Regression Including the steps to it.
The Log-Odds Equation is:
(log(π/1−π)=β0+β1x1+⋯++βp-1xp-1
Where:
π = Probability of event Y=1
β0,β1, etc. = Regression coefficients
X1,X2 etc. = Independent variables
Steps to Solve Regression
Start with the log-odds equation.
Exponentiate both sides to calculate the odds
π/1−π=eβ0+β1x1+⋯++βp-1xp-1
Solve for the probability π:
π/1−π=eβ0+β1x1+⋯++βp-1xp-11 + eβ0+β1x1+⋯++βp-1xp-1
Key Concepts:
Log-Odds: A linear function of parameters (βi).
Probability: Non-linear function of parameters.
Coefficient Interpretation (βk):
Each βk represents the increase in log-odds of Y=1 for a one-unit increase in Xk, holding other variables constant.
Classification Rule:
Convert probabilities into binary predictions using a cutoff (e.g., 0.5):
If predicted probability ≥ cutoff, classify Y=1
If predicted probability < cutoff, classify Y=0
What is a classification table? Draw it Out and Explain it.
A classification table (confusion matrix) evaluates how well a logistic regression model performs in predicting binary outcomes (e.g., 0/1) based on a cutoff level (like 0.5)
Table Breakdown:
The rows show the actual outcomes (0 or 1).
The columns show the predicted outcomes (0 or 1).
Actual/Predicted
0 (Predicted)
1 (Predicted)
0 (Actual)
True Negative (Correct)
False Positive (Incorrect)
1 (Actual)
False Negative (Incorrect)
True Positive (Correct)
Definitions and Metrics
Percent Correct=True Positive+True Negative / Total Observations
Measures the overall accuracy of the model.
True Positive Rate (Sensitivity)=True Positive / True Positive+False Negative
Measures how well the model correctly identifies positive cases (Y=1).
High sensitivity = Few false negatives.
True Negative (Specificity)=True Negative / True Negative+False Positive
Measures how well the model correctly identifies negative cases (Y=0).
High specificity = Few false positives.
Explain the cutoff value and what is selected. When do you adjust cutoffs?
The cutoff value in logistic regression determines the threshold probability at which we classify a prediction as 1 (yes) or 0 (no).
Default cutoff = 0.5: If the predicted probability ≥ 0.5, classify as 1; otherwise, classify as 0.
When to Adjust the Cutoff?
If False Positives (FP) are more costly than False Negatives (FN):
Increase the cutoff value above 0.5 to reduce false positives.
This makes the model more conservative and less likely to predict 1 (positive).
If False Negatives (FN) are more costly than False Positives (FP):
Lower the cutoff value below 0.5 to reduce false negatives.
This makes the model more likely to predict 1 (positive).
What is a ROC Curve and Chart in the Logistic Regression?
It shows the classification accuracy of the model at different cutoff levels.
The perfect model has a true positive rate of 100% and a false positive rate of 0%. I.e.,
at the upper left-hand corner of the ROC chart, and
the area under the ROC curve of 1.0 (compared with 0.5 for a naive model that makes the same classification: always 1 or always 0)
Explain choosing a cutoff in the ROC Curve the best level?
Choosing the Best Cutoff: You want a cutoff that:
Maximizes the weighted average of:
Sensitivity (True Positive Rate): Proportion of actual positives correctly predicted.
Specificity (True Negative Rate): Proportion of actual negatives correctly predicted.
OR
Minimizes the weighted average of:
% False Positive Rate: Predicted Y=1 when it’s actually Y=0.
% False Negative Rate: Predicted Y=0 when it’s actually Y=1.
Bankruptcy Classification: Altman’s Z Score: What are Altman found five factors that helps predict bankruptcy?
X1: Working capital / Total assets
Measures how liquid, cash-like assets (or liquidity level in relation to the size of the company)
X2: Retained earnings / Total assets
Measures long-term profitability over the life of the company
X3: Earnings before interest a
nd taxes / Total assets
Measures recent, or short-term profitability of the company
X4: Market value of stockholders’ equity / Book value of total debt owed
Measures long-term solvency of the company, or whether the company will have sufficient funds to pay its debt as it comes due
X5: Sales / Total assets
Measures asset efficiency, or how well assets are utilized.
Bankruptcy Classification: Decision Rules: What is Original Altman’s Z-score formula? What decision rule does the z score outcome have?
Z = 1.2 X1 + 1.4 X2 + 3.3 X3 + 0.6 X4 + 1.0 X5
Original Altman’s Z-score formula:
Decision Rules
If Z < 1.80
Classify as significant risk of bankruptcy, or in the “distress zone”
If Z >= 1.80 and Z < 3.00
Classify as at risk of bankruptcy, or in the “gray zone”
If Z >= 3.00
Classify as not currently at risk of bankruptcy, or in the “safe zone”
Explain the trade Trade-Off Between False Positive and False Negative in the context of bankruptcy prediction using Altman’s model.
If the cost of false positives (Type I errors) is higher (e.g., rejecting healthy companies), you might set a higher cutoff level.
If the cost of false negatives (Type II errors) is higher (e.g., missing bankruptcies), you set a lower cutoff to be stricter in predicting bankruptcy.
Outcome
H₀ is False (Bankruptcy occurs)
H₀ is True (No bankruptcy)
Reject H₀
Correct (True Positive: 1 - β)
Type I Error (False Positive: α)
Accept H₀
Type II Error (False Negative: β)
Correct (True Negative: 1 - α)
What are Beneish’s eight factors that predict financial statement fraud?
Increase in receivables as compared to last period.
Decline in gross margin as a percent of sales.
Decline in asset quality index (sum of current and noncurrent physical assets as compared to total assets).
Increase in sales growth in current period.
Decrease in depreciation expense.
Decrease in selling, general, and administrative costs.
Increase in debt (leverage).
Higher total accruals to total assets.
What does Beneish’s M-Score tell us about fraud, and what are false positives/negatives?
Rule: M-Score < -1.78 = Safe; M-Score ≥ -1.78 = Possible Fraud.
False Positive: Company is flagged for fraud but innocent (M-Score ≥ -1.78 incorrectly).
False Negative: Company is NOT flagged but committed fraud (M-Score < -1.78 incorrectly).
Accuracy: 76% fraud detection, 17.5% false positives.
DSRI: Unusual increase in receivables → inflated sales?
GMI: Decline in gross margin → hiding poor performance?
AQI: More questionable assets → aggressive accounting?
SGI: Rapid sales growth → pressure to manipulate?
DEPI: Reduced depreciation → delayed expenses?
SGAI: Changes in SG&A → expense manipulation?
TATA: High accruals → non-cash earnings manipulation?
LVGI: Increased debt → pressure to meet obligations?