graphs Flashcards
receiver operating characteristic (ROC) graph
usually for ranking classifiers (usually binary); for accepting n most likely classifications (as “positive”), over all the test set of size N, create confusion matrix; plot false positive rate on x axis (N-n divided by total negative in the whole set), and true positive rate on y axis (n divided by total positive in the whole set); plot over all acceptable n
features:
- ROC graphs remove class priors (eg class proportion imbalances)–they allow looking at the model’s predictive power (“if there are many negative examples, even a moderate false alarm rate can be unmanageable”)
- do not factor in costs/benefits
- for ranking classifiers, the area under the ROC curve is of significance (above and to left of diagonal); this statistic is equivalent to the Mann-Whitney-Wilcoxon measure; it’s also equivalent to the Gini coefficient (with a “minor algebraic transformation”)
ROC space details:
- a classifier near the LLC (left side and near x-axis, abv main diagonal) are interpreted as “conservative”–they make in-class predictions only with strong evidence, so make few false positive errors (but sacrifice true positives in the process)
- a classifier near the URC (abv main diagonal, but on rh side, w/ y close to 1) is interpreted as “permissive”–they make positive classifications with “weak evidence”
- diagonal line from (0,0) to (1,1)–the policy of “guessing a class” (in a Bernoulli sense); eg guesses positive class half the time (coin-flip-wise), it will converge to (0.5,0.5); guesses positive 90% of the time, will converge to (0.9,0.9)
- any performance in the square half below and to the right of the (0,0) to (1,1) diagonal would be “worse than random guessing”
- a ranking model (usually) starts with everything classified as “N” (ie we select the top “zero” entries of the test set in the ranking order)–so in the LL corner of the ROC space (0,0) / nothing is ranked as positive, so both true and false positive rates are 0 (highly conservative)
- at the other extreme, for high “n,” the ranking model is assuming everything is positive, arbitrarily, putting points in the UR corner of ROC space (1,1) (highly permissive)
- for optimal ranking classifiers, we would expect the curve getting close to ideal–UL corner in ROC space (0,1), where all true positives in the test set have been accurately classified, with no false positives
profit curve
with a ranking classifier, create confusion matrix for accepting n most likely correct category classifications; compute profit/loss from the confusion matrix; plot profit/loss as a function of n
2x2 classification table
frequency matrix for binary classification problems
usually,
predictions are on rows: (1) positive, (2) negative
true classes are on columns: (1) positive, (2) negative
rates are column-based:
- sensitivity aka recall, true positive rate, proportion of positive outcomes predicted positive
- specificity aka precision, true negative rate, proportion of negative outcomes predicted negative
confusion matrix
a frequency matrix for classification problems; each row a model (class) prediction and each column the actual class; the closer to diagonal the matrix is, the better the model; useful for imbalanced classes, giving more information re accuracy
learning curve
for a given model and a fixed holdout set size, plot the model accuracy as a function of training set data size; typically plateaus as marginal gain of more data goes to 0
gini coefficient
a general measure of dispersion, as area between Lorenz curve and diagonal line; eg plot the cumulative holdings of wealth by the population, with population ordered in increasing order of wealthiness–if everyone had same wealth, g.c.=0
fitting graph
typically x axis is “model complexity” and y axis is model accuracy on (a) training data and (b) holdout data; “sweet spot” is where training data and holdout data plots are about to diverge away from each other–where training data starts to get increasingly accurate (overfitting), and holdout accuracy starts to plunge
cumulative response curve and lift curve
for a ranked classifier at cutoff n with test set of size N, plots the true positive rate on the y-axis (n divided by total number of positives in the test set), against the proportion of the population that is considered in the class of relevance (i.e. n/N)
features:
- similar to ROC curve, the greater the “lift” (rise abv main diag), the better the performance
- in a true lift curve, the performance at any x value registers as the ratio between the curve’s value and the diagonal
- cumulative response curves are not entirely independent of class priors–class priors determine potential rate of increase of the curve (unlike with ROC)
dendogram
a 2-D visualization for progressive clustering; instances are on the x-axis, and the degree of clustering (low to high) is on the y-axis; the instances are ordered so that initial clusters are immediate neighbors, recursing on this ordering scheme as clustering is increased (i.e. at a given height / level on the dendogram, the ordering scheme applies to subgroups of instances)
entropy graph
re segmentation and information gain–a visualization of the weighted-sum-of-entropies resulting from any given segmentation scheme–each segment occupies a proportion (0 to 1) on the x axis, the segment’s height is the classification entropy (so a kind of bar plot); low height means low entropy (so “good” classification for that segment)
scree plot
used (at least) in context of PCA, showing the percent of total variance as a function of the number of (leading) PCA components retained; so it allows figuring out how many PCA components to retain for modeling purposes
calibration plot / reliability diagram
for checking performance of probabilistic classification models
for k classes, pick the class of interest, C (one plot per class)
define a bin as a probability range [p-low,p-high]
group all instances in the test set with class C predicted probability in [p-low,p-high] into set S
count the number, n, of instances in S that are actually of class C
n / |S| should be approximately within [p-low,p-high]
calibration histograms / heat maps
for checking performance of probabilistic classification models
for 2 classes
* group test set into true positive and true negative outcomes
* for each group, plot histogram of probability predictions for (say) negative outcomes
* the true positive histogram should be skewed toward 0 (no probability of negative outcome), and the true positive histogram should be skewed toward 1
for > 2 classes
* construct a per-instance heat map, usually with eg rows grouped by true class
* for each instance, each of k categories gets a color/intensity, reflecting probability
* for each instance true class group, probabilities should be clustered around the given class
scatterplot matrix
shows pairwise correlations between all numeric predictors; (note a feature plot may include scatterplots, but is more general)
predictor plot
plots each predictor against target variable (varies depending on categoric / numeric types)