Data Science using Python and R - 7 Flashcards

1
Q

What is the primary focus of model evaluation in data science?

A

To evaluate the usefulness of models in making predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between evaluation and validation?

A

Validation ensures consistency between training and test data sets, while evaluation measures accuracy and error rates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of models are discussed in the context of evaluation measures in this chapter?

A

Classification models, specifically decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a contingency table?

A

A table that summarizes the performance of a classification model by comparing predicted and actual outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define the terms TN, FP, FN, and TP in the context of a contingency table.

A
  • TN: True Negatives
  • FP: False Positives
  • FN: False Negatives
  • TP: True Positives
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does accuracy measure in model evaluation?

A

The proportion of correct classifications made by the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the formula for calculating the error rate?

A

Error Rate = 1 - Accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do sensitivity and specificity measure in classification models?

A
  • Sensitivity: Ability to classify positive records correctly
  • Specificity: Ability to classify negative records correctly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is precision defined?

A

Precision = TP / TPP.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does recall measure?

A

Recall is another term for sensitivity, measuring the proportion of actual positives captured by the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are Fβ scores used for?

A

To combine precision and recall into a single measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does F1 score represent?

A

The harmonic mean of precision and recall, with equal weighting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the method for model evaluation?

A
  • Develop the model using the training data set
  • Evaluate the model using the test data set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the target variable in the clothing data example?

A

Response, coded as 1 for positive and 0 for negative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the three continuous predictors in the clothing data example?

A
  • Days since Purchase
  • # of Purchase Visits
  • Sales per Visit
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the accuracy of Model 1 indicate?

A

Model 1 has an accuracy of 0.8410 or 84.10%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the baseline performance for the All Negative Model?

A

Accuracy = TAN / GT.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does a specificity of 0.9541 indicate about Model 1?

A

The model correctly classified 95.41% of actual negative records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is indicated by Model 1’s sensitivity of 0.2804?

A

Only 28.04% of actual positive records were classified as positive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How is precision calculated for Model 1?

A

Precision = TP / TPP.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does the F1 score of 0.372 signify?

A

It reflects the balance between precision and recall for the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the steps to perform model evaluation using R?

A
  • Develop Model 1 using training data
  • Run test data through Model 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What command is used to create a contingency table in R?

A

table() command.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What command is used to create a contingency table in R?

A

table() command

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How do you add row names to a table in R?

A

Use row.names() function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What does the addmargins() command do in R?

A

Adds a Total row and column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the costs associated with True Negative (TN)?

A

$0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the cost of a False Positive (FP) in the context of a clothing retailer?

A

$10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the cost associated with a False Negative (FN)?

30
Q

What does True Positive (TP) represent in terms of cost?

31
Q

What is the adjusted cost matrix for the lender as shown in Table 7.6?

A

CostTN = 0, CostFP = 1; CostFN = 4, CostTP = 0

32
Q

What is the importance of accounting for unequal error costs in modeling?

A

To improve profitability and model accuracy

33
Q

What is the formula for Overall Model Cost when error costs are unequal?

A

Overall Model Cost = FP Cost + TP Cost

34
Q

When comparing models, what is the most important evaluation measure when error costs are unequal?

A

Model cost per record

35
Q

What does Sensitivity measure in model evaluation?

A

Proportion of positive responders captured

36
Q

What does Specificity measure in model evaluation?

A

Proportion of nonresponders correctly identified

37
Q

What is the relationship between Accuracy and Error Rate?

A

Error Rate = 1 − Accuracy

38
Q

How is the profit per customer calculated for a model?

A

Profit per Customer = Overall Model Cost / GT

39
Q

What does the term ‘data-driven error costs’ refer to?

A

Costs derived from analyzing existing data rather than assumed values

40
Q

What was the mean Sales per Visit used to update the CostTP?

41
Q

What does the term ‘contingency table’ consist of?

A

Actual categories vs. predicted categories

42
Q

What happens to model accuracy when unequal error costs are considered?

A

Accuracy may not be the best metric for comparison

43
Q

What is the effect of having higher sensitivity in a model with unequal error costs?

A

Leads to better profitability

44
Q

What does the F2 score emphasize in model evaluation?

A

Favors recall (sensitivity) over precision

45
Q

What is the cost associated with the All Negative model?

46
Q

What is the profit per customer for Model 1?

47
Q

What is the profit per customer for Model 2?

48
Q

What is the profit per customer for Model 3?

49
Q

Fill in the blank: The retailer’s false negative cost is ______ times higher than the false positive cost.

50
Q

True or False: Model 1 had the highest accuracy but was the least profitable.

51
Q

What should data scientists do to evaluate their models effectively?

A

Use metrics that consider unequal error costs

52
Q

What is the formula for Error Rate?

A

Error Rate = 1 − Accuracy

This formula provides a way to measure the rate of incorrect predictions in a model.

53
Q

What do sensitivity and specificity measure?

A

Sensitivity measures the true positive rate, while specificity measures the true negative rate

These metrics are crucial for evaluating the performance of classification models.

54
Q

What is the Method for Model Evaluation?

A

The method involves comparing predicted values against actual values to assess model performance

This typically includes metrics such as accuracy, precision, recall, and F1 score.

55
Q

Why was the All Negative model chosen as a baseline for Model 1?

A

It provides a conservative baseline for accuracy calibration, as it always predicts the negative class

This helps in understanding the performance of more complex models.

56
Q

When error costs are unequal, what is the most important evaluation measure for model selection?

A

The most important measure is often the cost-sensitive accuracy or the expected cost

This accounts for the different costs associated with false positives and false negatives.

57
Q

How might a naïve analyst erroneously prefer Model 1 to Model 2?

A

By only looking at overall accuracy without considering error costs or class distribution

This can lead to misinterpretation of model effectiveness.

58
Q

What metrics need to be calculated for the All Positive and All Negative models?

A

Evaluation metrics including accuracy, precision, recall, and F1 score

These metrics provide a comprehensive view of model performance.

59
Q

What variables are used to create a C5.0 model (Model 1) for predicting customer Response?

A

Days since Purchase, # of Purchase Visits, and Sales per Visit

These features are critical for understanding customer behavior.

60
Q

What is a contingency table used for in model evaluation?

A

To compare actual and predicted values of a response variable

This helps visualize the performance of the model.

61
Q

What is the purpose of creating a cost matrix in model evaluation?

A

To specify the costs associated with different types of errors (false positives and false negatives)

This allows for more informed decision-making based on financial implications.

62
Q

What is the 4x cost matrix?

A

A cost matrix where a false positive is considered four times as bad as a false negative

This reflects the differing impacts of errors in certain business contexts.

63
Q

What should be included in the Model Evaluation Table for Model 2?

A

Overall Model Cost and Profit per Customer, along with other evaluation metrics

This table helps in comparing multiple models effectively.

64
Q

What is the purpose of the simplified data-driven cost matrix?

A

To reflect more accurate costs based on actual data rather than arbitrary values

This enhances the decision-making process in model evaluation.

65
Q

What is the significance of comparing evaluation measures from different models?

A

It highlights the strengths and weaknesses of each model and informs better model selection

This comparison is essential for improving predictive performance.

66
Q

What features are used in the C5.0 model for predicting a loan applicant’s Approval?

A

Debt‐to‐Income Ratio, FICO Score, and Request Amount

These features are key indicators of a loan applicant’s creditworthiness.

67
Q

What does a contingency table for Model 1 predict?

A

It compares the actual and predicted values of loan Approval

This aids in assessing the accuracy of the predictions made by the model.

68
Q

What is the purpose of calculating the mean for the Interest per loan applicant?

A

To set the negative value as the cost of a true positive in the cost matrix

This establishes a data-driven approach to evaluating the model’s performance.

69
Q

What should be done after building Model 2 for loan approval?

A

Populate the Model Evaluation Table with evaluation measures using the data-driven cost matrix

This allows for a structured comparison with Model 1.

70
Q

How can one quantify the financial impact of using data-driven error costs?

A

By calculating the total profit made by the bank based on the evaluation of the models

This provides insight into the economic benefits of accurate model predictions.