Credibility: Robust Evaluation Models Flashcards

1
Q

Credibility

A

Metrics Performance Evaluation: How evaluate?
Methods for Performance: How to obtain reliable?
Methods for Comparison: How to compare?
Model Selection: Which model?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Metrics Performance Evaluation: How evaluate?

A

Residual Sum of Squares (RSS): sum(yi - sum(wjhj(xi))^(2))
R2: TSS = sum(yi - y)^2 R2 = 1 - RSS/TSS
MSE = (1/N)
sum(yi*-yi)^2
RMSE = sqrt(MSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Metrics Performance Evaluation: Classification

A

TP - FN
FP - TN

Accuracy = TP + TN / (ALL)
Cost(X): evaluate, or guide the search or to build model (decision trees)

Precision: positive that are actually positive. TP/(TP+FP)
Recall: positive classified w.r.t existing good doc. TP/(TP+FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Methods for Performance: How to obtain reliable?

A

Training, Validation and Testing

Holdout: 2/3 - 1/3
Random Subsampling: repeated holdout.
Cross Validation: k disjoints subsets. k-fold. Leave one out.
Stratified sampling: each class represented with approx equal proportions.

Bootstrap: sampling with replacement.
0.632 bootstrap: probability of ending up in test data is: 0.368
error: 0.632etest + 0.368etrain
repeat several times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Methods for Comparison: How to compare?

A

Paired t-Test using k-fold Crossvalidation
Generate k folds and for each compute performance of model A and B
di = perf ai - perf bi
mean = (1/k)sum(di)
std=sqrt((1/k)
sum(di-mean)^2)
Two hypothesis: H0: mean = 0, Ha: mean != 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Methods for Comparison: Multiple Testing

A

0.05 threshold in 20 different observations.
0.95^20 = 0.358
at least one mistake = 1 - 0.358 = 0.642

Bonferroni Correction
	Tests are independent
	Divide p-value by number of tests
	0.05/20 = 0.0025
	at least one mistake = 1 - 0.9512 = 0.0488
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Methods for Comparison: Probabilistic Classifiers

A

Logistic Regression returns a probability
P(yi|xi) threshold of 0.5
Threshold of 0.75 for one label.
Higher Threshold, more precision, low recall
Lower Threshold, less precision, more recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Methods for Comparison: Precision-Recall Curves

A

Precision as function of recall varying threshold values.

Best classifier would be the one with precision always equal to one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Methods for Comparison: Receiver Operating Characteristicc ROC

A
TPR = TP/(TP+FN)
FPR = FP/(FP+TN)

(0,0) everything negative
(1,1) everything positive
(0,1) ideal

No model consistently outperform other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Methods for Comparison: Lift Charts

A

Measure of effectiveness of predictive model
Ratio between with and without predictive model
Cumulative Gains and Lift Charts are visual aids
Greater area, better the model.
100.000 response rate is 0.4% that is 400 response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Model Selection: Which model?

A

Occam’s Razor: best theory is the smallest one that describes all the facts.
No free lunch Theorem: no favor one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly