model analysis Flashcards

1
Q

reasons predictive models fail

A

PVEOQ

  • inadequate preprocessing of the data
  • inadequate model validation (eg not enough cross-fold validation statistics)
  • unjustified extrapolation (eg the model tries to predict a point outside its training region in predictor phase space)
  • overfitting
  • not considering enough models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

variance / bias tradeoff (and underfit / overfit)

A
  • bias–how far from theoretical optimum (ie down to irreducible noise) our model’s fit is
  • variance–over an ensemble of full resampling and training runs, how much does the models’ fit vary
  • overfitting–overfitting can produce naturally low bias, but the model is overtrained on specific data and might not do well with “unseen data”–the variance between runs will be high
  • underfitting–not very accurate (high bias), but it might perform about the same on unseen data (eg a straight line fit to an amorphous point cloud)
  • eg in cross-validation
    • very many folds can reduce bias (ie training on most of the test set for the k-1 folds being close to the best model we could expect, given the training data), but show high variance between resamples (the small held-out fold could contain a lot of outliers, eg)
    • very few folds can produce a poor fit (high bias), but low variance (performs about the same on unseen data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

decomposition of MSE in model fitting and error analysis

A
  • the MSE formula, (1/n) sum_{test_samples} (y_i - y-hat_i)^2, can have its mean decomposed into compoments
  • E(MSE) = sig^2 + (model bias)^2 + model variance
    • sig–the inherent, irreducible noise in the data
    • bias–the misfit between the model in question and the ideal model’s fitting surface
    • variance–the inter-training variations as the model is fit on various sample data from the population
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Kappa (Cohen’s)

A
  • for classification models, a measure of accuracy; rooted in comparing eg 2 parties’ predictions
  • considers the confusion matrix
  • kappa = 1- (1-p_o) / (1-p_e)
    • p_o is observed agreement
    • p_e is expected agreement
  • for binary confusion matrix
    • p_o is the proportion of samples on the diagonal
    • p_e is computed as sum of the products of the row/col marginals relevant to the diagonal entries (sum of 2 products)
  • Fleiss may extend this to > 2 classes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

log loss

A
  • aka logarithmic loss or cross-entropy loss
  • eg for binary classifier
    • form the likelihood–how likely dod the model think the actual training observations were?
    • ie for classes {0,1}, with probabilty refering to how likely class 1, if the instance j is labeled 0, use (1-pj), and if instance j is labeled 1, use pj–then take product over all instances
  • the negative of the log of this result is the log loss
  • penalizes “confidently wrong” answers more than eg Brier score
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Brier score

A
  • for probabilistic classification models, a measure of accuracy
  • k classes are coded as k-tuples, eg (0,1,0) for a training instance of class B of A, B, C
  • then take MSE between these class labels, coded as tuples, and model class predictions–sum over each instance, then over all instances
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

odds ratio

A
  • can be used to assess informativeness of binary predictors in the case of binary classification; literally a ratio of odds (ie p/(1-p))
  • compute “probability” of an event (binary outcome variable, say, is “positive”) (via instance class labels) for both levels of the (binary) predictor–then,
    • OR = odds of “positive” prediction at predictor level A / odds of “positive prediction at predictor level B = p1/(1-p1) / p2/(1-p2) = p1(1-p2) / (p2(1-p1))
    • represents increase in odds of the “event” when going from first level of the predictor (ie related to p1) to the second level (ie related to p2)
  • can also be used for control/treatment group comparisons, where it can effectively remove the class priors of the sample dataset, and still be applied to the general population without adjustment (Kaplan)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

global min/max search routines

A
  • eg have created a predictive model, now want to find the point (predictor values) corresponding to max/min outcome values
  • methods:
    • Nelder-Mead simplex method
    • simulated annealing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly