Chapter 5 Flashcards

1
Q

Prediction

A
Average Error, MAPE (
Mean Absolute Percentage Error
), RMSE 
(Root-Mean-Square-Error)
, 
Validation Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Classification

A

Classification matrix, specificity, sensitivity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ROC (Receiver Operating Characteristic)

A

to assess performance at different cutoff values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Detect overfitting

A

compare validation to training data:

some differences expected, extreme differences may indicate overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Naïve rule

A
classify all records as belonging to the most prevalent class
benchmark:  we hope to do better than that
Using external predictor info should outperform naïve rule
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Exception to Naïve rule

A

when goal is to identify high-value but rare outcomes, we may do well by doing worse than the naïve rule (see “lift” – later)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

There are various performance measures comparing to the naïve rule

A

For example: multiple R squared, measures classifier fit to naïve rule.

The equivalent to using the naïve rule for classification is y^ (the sample mean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Lift Chart for Predictive Error

A

Y axis is cumulative value of numeric target variable (e.g., revenue), instead of cumulative count of “responses”

X axis is cumulative number of cases, sorted left to right in order of predicted value

Benchmark is average numeric value per record, i.e. not using model (aka The Naïve Rule)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Misclassification error

A

Error = classifying a record as belonging to one class when it belongs to another class.

Error rate = percent of misclassified records out of the total records in the validation data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

“High separation of records”

A

means that using predictor variables attains low error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

“Low separation of records”

A

means that using predictor variables does not improve much on naïve rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Confusion Matrix

A
actual class
predicted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Accuracy

A

1 – err

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cutoff Table

A

cut off is .50 , so everything above should be 1 and everything below should be 0. any records that are otherwise, it counted as misclassification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When One Class is More Important

A

we are willing to tolerate greater overall error, in return for better identifying the important class for further attention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sensitivity

A

The ability to detect important class members correctly

17
Q

Specificity

A

ability to rule out C0 members classified correctly

18
Q

False positive

A

% of predicted “C1’s” that were not “C1’s”
false alarm,
indicates a given condition exists, when it does not.

19
Q

False negative

A

% of predicted “C0’s” that were not “C0’s”

indicates a given condition not exists, but it really does

20
Q

Lift and Decile Charts: Goal

A
Useful for assessing performance in terms of identifying the most important class
The goal is to obtain a rank ordering among the records according to their estimated
probabilities of class membership
Compare performance of DM model to “no model, pick randomly”
21
Q

Decile Chart

A

In “most probable” (top) decile, model is twice as likely to identify the important class compared to avg. prevalence.

22
Q

Lift vs. Decile Charts

A

Decile chart does this in decile chunks of data
Y axis shows ratio of decile mean to overall mean

Lift chart shows continuous cumulative results
Y axis shows number of important class records identified