Data Science using Python and R - 7 Flashcards
What is the primary focus of model evaluation in data science?
To evaluate the usefulness of models in making predictions.
What is the difference between evaluation and validation?
Validation ensures consistency between training and test data sets, while evaluation measures accuracy and error rates.
What type of models are discussed in the context of evaluation measures in this chapter?
Classification models, specifically decision trees.
What is a contingency table?
A table that summarizes the performance of a classification model by comparing predicted and actual outcomes.
Define the terms TN, FP, FN, and TP in the context of a contingency table.
- TN: True Negatives
- FP: False Positives
- FN: False Negatives
- TP: True Positives
What does accuracy measure in model evaluation?
The proportion of correct classifications made by the model.
What is the formula for calculating the error rate?
Error Rate = 1 - Accuracy.
What do sensitivity and specificity measure in classification models?
- Sensitivity: Ability to classify positive records correctly
- Specificity: Ability to classify negative records correctly
How is precision defined?
Precision = TP / TPP.
What does recall measure?
Recall is another term for sensitivity, measuring the proportion of actual positives captured by the model.
What are Fβ scores used for?
To combine precision and recall into a single measure.
What does F1 score represent?
The harmonic mean of precision and recall, with equal weighting.
What is the method for model evaluation?
- Develop the model using the training data set
- Evaluate the model using the test data set
What is the target variable in the clothing data example?
Response, coded as 1 for positive and 0 for negative.
What are the three continuous predictors in the clothing data example?
- Days since Purchase
- # of Purchase Visits
- Sales per Visit
What does the accuracy of Model 1 indicate?
Model 1 has an accuracy of 0.8410 or 84.10%.
What is the baseline performance for the All Negative Model?
Accuracy = TAN / GT.
What does a specificity of 0.9541 indicate about Model 1?
The model correctly classified 95.41% of actual negative records.
What is indicated by Model 1’s sensitivity of 0.2804?
Only 28.04% of actual positive records were classified as positive.
How is precision calculated for Model 1?
Precision = TP / TPP.
What does the F1 score of 0.372 signify?
It reflects the balance between precision and recall for the model.
What are the steps to perform model evaluation using R?
- Develop Model 1 using training data
- Run test data through Model 1
What command is used to create a contingency table in R?
table() command.
What command is used to create a contingency table in R?
table() command
How do you add row names to a table in R?
Use row.names() function
What does the addmargins() command do in R?
Adds a Total row and column
What are the costs associated with True Negative (TN)?
$0
What is the cost of a False Positive (FP) in the context of a clothing retailer?
$10
What is the cost associated with a False Negative (FN)?
$0
What does True Positive (TP) represent in terms of cost?
−$40
What is the adjusted cost matrix for the lender as shown in Table 7.6?
CostTN = 0, CostFP = 1; CostFN = 4, CostTP = 0
What is the importance of accounting for unequal error costs in modeling?
To improve profitability and model accuracy
What is the formula for Overall Model Cost when error costs are unequal?
Overall Model Cost = FP Cost + TP Cost
When comparing models, what is the most important evaluation measure when error costs are unequal?
Model cost per record
What does Sensitivity measure in model evaluation?
Proportion of positive responders captured
What does Specificity measure in model evaluation?
Proportion of nonresponders correctly identified
What is the relationship between Accuracy and Error Rate?
Error Rate = 1 − Accuracy
How is the profit per customer calculated for a model?
Profit per Customer = Overall Model Cost / GT
What does the term ‘data-driven error costs’ refer to?
Costs derived from analyzing existing data rather than assumed values
What was the mean Sales per Visit used to update the CostTP?
$113.58
What does the term ‘contingency table’ consist of?
Actual categories vs. predicted categories
What happens to model accuracy when unequal error costs are considered?
Accuracy may not be the best metric for comparison
What is the effect of having higher sensitivity in a model with unequal error costs?
Leads to better profitability
What does the F2 score emphasize in model evaluation?
Favors recall (sensitivity) over precision
What is the cost associated with the All Negative model?
$0 profit
What is the profit per customer for Model 1?
$4.97
What is the profit per customer for Model 2?
$10.87
What is the profit per customer for Model 3?
$12.44
Fill in the blank: The retailer’s false negative cost is ______ times higher than the false positive cost.
four
True or False: Model 1 had the highest accuracy but was the least profitable.
True
What should data scientists do to evaluate their models effectively?
Use metrics that consider unequal error costs
What is the formula for Error Rate?
Error Rate = 1 − Accuracy
This formula provides a way to measure the rate of incorrect predictions in a model.
What do sensitivity and specificity measure?
Sensitivity measures the true positive rate, while specificity measures the true negative rate
These metrics are crucial for evaluating the performance of classification models.
What is the Method for Model Evaluation?
The method involves comparing predicted values against actual values to assess model performance
This typically includes metrics such as accuracy, precision, recall, and F1 score.
Why was the All Negative model chosen as a baseline for Model 1?
It provides a conservative baseline for accuracy calibration, as it always predicts the negative class
This helps in understanding the performance of more complex models.
When error costs are unequal, what is the most important evaluation measure for model selection?
The most important measure is often the cost-sensitive accuracy or the expected cost
This accounts for the different costs associated with false positives and false negatives.
How might a naïve analyst erroneously prefer Model 1 to Model 2?
By only looking at overall accuracy without considering error costs or class distribution
This can lead to misinterpretation of model effectiveness.
What metrics need to be calculated for the All Positive and All Negative models?
Evaluation metrics including accuracy, precision, recall, and F1 score
These metrics provide a comprehensive view of model performance.
What variables are used to create a C5.0 model (Model 1) for predicting customer Response?
Days since Purchase, # of Purchase Visits, and Sales per Visit
These features are critical for understanding customer behavior.
What is a contingency table used for in model evaluation?
To compare actual and predicted values of a response variable
This helps visualize the performance of the model.
What is the purpose of creating a cost matrix in model evaluation?
To specify the costs associated with different types of errors (false positives and false negatives)
This allows for more informed decision-making based on financial implications.
What is the 4x cost matrix?
A cost matrix where a false positive is considered four times as bad as a false negative
This reflects the differing impacts of errors in certain business contexts.
What should be included in the Model Evaluation Table for Model 2?
Overall Model Cost and Profit per Customer, along with other evaluation metrics
This table helps in comparing multiple models effectively.
What is the purpose of the simplified data-driven cost matrix?
To reflect more accurate costs based on actual data rather than arbitrary values
This enhances the decision-making process in model evaluation.
What is the significance of comparing evaluation measures from different models?
It highlights the strengths and weaknesses of each model and informs better model selection
This comparison is essential for improving predictive performance.
What features are used in the C5.0 model for predicting a loan applicant’s Approval?
Debt‐to‐Income Ratio, FICO Score, and Request Amount
These features are key indicators of a loan applicant’s creditworthiness.
What does a contingency table for Model 1 predict?
It compares the actual and predicted values of loan Approval
This aids in assessing the accuracy of the predictions made by the model.
What is the purpose of calculating the mean for the Interest per loan applicant?
To set the negative value as the cost of a true positive in the cost matrix
This establishes a data-driven approach to evaluating the model’s performance.
What should be done after building Model 2 for loan approval?
Populate the Model Evaluation Table with evaluation measures using the data-driven cost matrix
This allows for a structured comparison with Model 1.
How can one quantify the financial impact of using data-driven error costs?
By calculating the total profit made by the bank based on the evaluation of the models
This provides insight into the economic benefits of accurate model predictions.