Credibility: Robust Evaluation Models Flashcards
Credibility
Metrics Performance Evaluation: How evaluate?
Methods for Performance: How to obtain reliable?
Methods for Comparison: How to compare?
Model Selection: Which model?
Metrics Performance Evaluation: How evaluate?
Residual Sum of Squares (RSS): sum(yi - sum(wjhj(xi))^(2))
R2: TSS = sum(yi - y)^2 R2 = 1 - RSS/TSS
MSE = (1/N)sum(yi*-yi)^2
RMSE = sqrt(MSE)
Metrics Performance Evaluation: Classification
TP - FN
FP - TN
Accuracy = TP + TN / (ALL)
Cost(X): evaluate, or guide the search or to build model (decision trees)
Precision: positive that are actually positive. TP/(TP+FP)
Recall: positive classified w.r.t existing good doc. TP/(TP+FN)
Methods for Performance: How to obtain reliable?
Training, Validation and Testing
Holdout: 2/3 - 1/3 Random Subsampling: repeated holdout. Cross Validation: k disjoints subsets. k-fold. Leave one out. Stratified sampling: each class represented with approx equal proportions.
Bootstrap: sampling with replacement.
0.632 bootstrap: probability of ending up in test data is: 0.368
error: 0.632etest + 0.368etrain
repeat several times.
Methods for Comparison: How to compare?
Paired t-Test using k-fold Crossvalidation
Generate k folds and for each compute performance of model A and B
di = perf ai - perf bi
mean = (1/k)sum(di)
std=sqrt((1/k)sum(di-mean)^2)
Two hypothesis: H0: mean = 0, Ha: mean != 0
Methods for Comparison: Multiple Testing
0.05 threshold in 20 different observations.
0.95^20 = 0.358
at least one mistake = 1 - 0.358 = 0.642
Bonferroni Correction Tests are independent Divide p-value by number of tests 0.05/20 = 0.0025 at least one mistake = 1 - 0.9512 = 0.0488
Methods for Comparison: Probabilistic Classifiers
Logistic Regression returns a probability
P(yi|xi) threshold of 0.5
Threshold of 0.75 for one label.
Higher Threshold, more precision, low recall
Lower Threshold, less precision, more recall
Methods for Comparison: Precision-Recall Curves
Precision as function of recall varying threshold values.
Best classifier would be the one with precision always equal to one
Methods for Comparison: Receiver Operating Characteristicc ROC
TPR = TP/(TP+FN) FPR = FP/(FP+TN)
(0,0) everything negative
(1,1) everything positive
(0,1) ideal
No model consistently outperform other.
Methods for Comparison: Lift Charts
Measure of effectiveness of predictive model
Ratio between with and without predictive model
Cumulative Gains and Lift Charts are visual aids
Greater area, better the model.
100.000 response rate is 0.4% that is 400 response.
Model Selection: Which model?
Occam’s Razor: best theory is the smallest one that describes all the facts.
No free lunch Theorem: no favor one.