graphs Flashcards

1
Q

receiver operating characteristic (ROC) graph

A

usually for ranking classifiers (usually binary); for accepting n most likely classifications (as “positive”), over all the test set of size N, create confusion matrix; plot false positive rate on x axis (N-n divided by total negative in the whole set), and true positive rate on y axis (n divided by total positive in the whole set); plot over all acceptable n

features:

  • ROC graphs remove class priors (eg class proportion imbalances)–they allow looking at the model’s predictive power (“if there are many negative examples, even a moderate false alarm rate can be unmanageable”)
  • do not factor in costs/benefits
  • for ranking classifiers, the area under the ROC curve is of significance (above and to left of diagonal); this statistic is equivalent to the Mann-Whitney-Wilcoxon measure; it’s also equivalent to the Gini coefficient (with a “minor algebraic transformation”)

ROC space details:

  • a classifier near the LLC (left side and near x-axis, abv main diagonal) are interpreted as “conservative”–they make in-class predictions only with strong evidence, so make few false positive errors (but sacrifice true positives in the process)
  • a classifier near the URC (abv main diagonal, but on rh side, w/ y close to 1) is interpreted as “permissive”–they make positive classifications with “weak evidence”
  • diagonal line from (0,0) to (1,1)–the policy of “guessing a class” (in a Bernoulli sense); eg guesses positive class half the time (coin-flip-wise), it will converge to (0.5,0.5); guesses positive 90% of the time, will converge to (0.9,0.9)
  • any performance in the square half below and to the right of the (0,0) to (1,1) diagonal would be “worse than random guessing”
  • a ranking model (usually) starts with everything classified as “N” (ie we select the top “zero” entries of the test set in the ranking order)–so in the LL corner of the ROC space (0,0) / nothing is ranked as positive, so both true and false positive rates are 0 (highly conservative)
  • at the other extreme, for high “n,” the ranking model is assuming everything is positive, arbitrarily, putting points in the UR corner of ROC space (1,1) (highly permissive)
  • for optimal ranking classifiers, we would expect the curve getting close to ideal–UL corner in ROC space (0,1), where all true positives in the test set have been accurately classified, with no false positives
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

profit curve

A

with a ranking classifier, create confusion matrix for accepting n most likely correct category classifications; compute profit/loss from the confusion matrix; plot profit/loss as a function of n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

2x2 classification table

A

frequency matrix for binary classification problems

usually,
predictions are on rows: (1) positive, (2) negative
true classes are on columns: (1) positive, (2) negative

rates are column-based:

  • sensitivity aka recall, true positive rate, proportion of positive outcomes predicted positive
  • specificity aka precision, true negative rate, proportion of negative outcomes predicted negative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

confusion matrix

A

a frequency matrix for classification problems; each row a model (class) prediction and each column the actual class; the closer to diagonal the matrix is, the better the model; useful for imbalanced classes, giving more information re accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

learning curve

A

for a given model and a fixed holdout set size, plot the model accuracy as a function of training set data size; typically plateaus as marginal gain of more data goes to 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

gini coefficient

A

a general measure of dispersion, as area between Lorenz curve and diagonal line; eg plot the cumulative holdings of wealth by the population, with population ordered in increasing order of wealthiness–if everyone had same wealth, g.c.=0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

fitting graph

A

typically x axis is “model complexity” and y axis is model accuracy on (a) training data and (b) holdout data; “sweet spot” is where training data and holdout data plots are about to diverge away from each other–where training data starts to get increasingly accurate (overfitting), and holdout accuracy starts to plunge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

cumulative response curve and lift curve

A

for a ranked classifier at cutoff n with test set of size N, plots the true positive rate on the y-axis (n divided by total number of positives in the test set), against the proportion of the population that is considered in the class of relevance (i.e. n/N)

features:

  • similar to ROC curve, the greater the “lift” (rise abv main diag), the better the performance
  • in a true lift curve, the performance at any x value registers as the ratio between the curve’s value and the diagonal
  • cumulative response curves are not entirely independent of class priors–class priors determine potential rate of increase of the curve (unlike with ROC)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

dendogram

A

a 2-D visualization for progressive clustering; instances are on the x-axis, and the degree of clustering (low to high) is on the y-axis; the instances are ordered so that initial clusters are immediate neighbors, recursing on this ordering scheme as clustering is increased (i.e. at a given height / level on the dendogram, the ordering scheme applies to subgroups of instances)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

entropy graph

A

re segmentation and information gain–a visualization of the weighted-sum-of-entropies resulting from any given segmentation scheme–each segment occupies a proportion (0 to 1) on the x axis, the segment’s height is the classification entropy (so a kind of bar plot); low height means low entropy (so “good” classification for that segment)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

scree plot

A

used (at least) in context of PCA, showing the percent of total variance as a function of the number of (leading) PCA components retained; so it allows figuring out how many PCA components to retain for modeling purposes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

calibration plot / reliability diagram

A

for checking performance of probabilistic classification models

for k classes, pick the class of interest, C (one plot per class)

define a bin as a probability range [p-low,p-high]

group all instances in the test set with class C predicted probability in [p-low,p-high] into set S

count the number, n, of instances in S that are actually of class C

n / |S| should be approximately within [p-low,p-high]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

calibration histograms / heat maps

A

for checking performance of probabilistic classification models

for 2 classes
* group test set into true positive and true negative outcomes
* for each group, plot histogram of probability predictions for (say) negative outcomes
* the true positive histogram should be skewed toward 0 (no probability of negative outcome), and the true positive histogram should be skewed toward 1

for > 2 classes
* construct a per-instance heat map, usually with eg rows grouped by true class
* for each instance, each of k categories gets a color/intensity, reflecting probability
* for each instance true class group, probabilities should be clustered around the given class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

scatterplot matrix

A

shows pairwise correlations between all numeric predictors; (note a feature plot may include scatterplots, but is more general)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

predictor plot

A

plots each predictor against target variable (varies depending on categoric / numeric types)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

classifier probability plot

A

for categorical outcomes, traceplane bar plot, by predictor, of frequency in each outcome factor level

17
Q

mosaic plot

A

a categorical predictor vs categorical outcome trace plane plot; shows instance frequencies over the discrete 2D space

18
Q

volcano plot

A
  • visualize statistical significance against a related variable
  • log of p-value (from t-test or ANOVA eg) on y-axis
  • eg
    • look over several one-hot-encoded predictors, comparing means of outcome variable between the presence / absence of a given (binary) predictor
    • plot on x-axis the difference bewteen the means, and the t-test result log(p) on the y-axis
19
Q

scree plot

A

fundamental to PCA, showing total variance accounted for as a function of (ordered) PCA components

20
Q

added variable plot

A
  • show partial relationship between single predictor and outcome, by controlling for other predictors (attempts to address the loss of information in single predictor trace-plane plots)
  • eg, given Y ~ X1 + X2 + X3
    • do 2 linear sub-regressions:
      • X1 ~ X2 + X3
      • Y ~ X2 + X3
    • plot residual coordinate pairs, (r1,r2), arising from each of the 2 sub-regressions
  • can “the part of X1 not contained in X2, X3 explain the part of Y not contained in X2, X3”
21
Q

calibration plots

A
  • used for classifiers that produce class probabilities
  • eg if we group together all training set predictions with ca 20% probability of class A, then about 20% of those training set samples should be in class A (checking labels)
  • can be used to compare model performance, and/or to create post-model-fitting steps to “correct” the probability scores (eg with another model (like Bayes rule))