Interpreting/Visualising Data & Models Flashcards
error analysis
why is it that a given model has misclassified an
instance in the way it has?
model interpretability
why is it that a given model has classified an instance in
the way it has?
How to do error analysis?
- identifying different “classes” of error that the system makes
- hypothesising as to what has caused the different errors, and testing those hypotheses against the actual data
- often feeding those hypotheses back into feature/model engineering to see if the model can be improved any (error analysis over only the dev data)
- quantifying whether (for different classes) it is a question of data quantity/sparsity, or something more fundamental
Remember to:
- test hypotheses against your data
- where possible, use the model to guide the error analysis
(a) a confusion matrix
(b) a random subsample of
misclassified instances
hyperparameter
parameters which define/bias/constrain the learning processor
parameters
what is learned when a
given learner with a given set of hyperparameters is applied
to a particular training dataset
model, parameters & hyperparameters
A model (trained with a given set of hyperparameters) can then be interpreted relative to the parameters associated with a given test instance
NN
Hyperparameters:
• k (neighbourhood size)
• distance/similarity metric
• feature weighting/selection …
Parameters:
• there are none, as the model is “lazy” and doesn’t abstract
away from the training instances in any way
Interpretation:
• relative to the training instances that give rise to a given classification, and their geometric distribution
NP
Hyperparameters:
• distance/similarity metric used to calculate the prototype,
and distance to each prototype in classification
• feature weighting/selection …
Parameters: • the prototype for each class • size = O(|C||F|) • C = set of classes • F = set of features
Interpretation:
• relative to the geometric distribution of the prototypes, and distance to each for a given test instance
NB
Hyperparameters:
• the choice of smoothing method
• optionally the choice of distribution used to model the
features (e.g. binomial per feature, or multinomial over all
features)
Parameters: • the class priors and conditional probability for each feature–value–class combination • size = O(|C| + |C||FV|) • C = set of classes • FV = set of feature–value pairs
Interpretation:
• usually based on the most positively-weighted features
associated with a given instance
Decision Tree
Hyperparameters:
• the choice of function used for attribute selection
• the convergence criterion
Parameters: • the decision tree itself • worst-case size = O(V|Tr|) typical size = O(|FV|) • V = average branching factor • Tr = set of training instances • FV = set of feature–value pairs
Interpretation:
• based directly on the path through the decision tree
SVM
Hyperparameters:
• the penalty term C/ for soft-margin SVMs
• feature value scaling
• the choice of kernel (and any hyperparameters associated
with it)
Parameters: • vector of feature weights + bias term • size = O(|C||F|) (assuming one-vs-rest SVM) • C = set of classes • F = set of features
Interpretation:
• the absolute value of the weight associated with each
non-zero feature in a given instance provides an indication
of its relative importance in classification (modulo kernel
projection)
Random Forest
Hyperparameters:
• number of trees B (can be tuned, e.g. based on “out-of-bag” error rate)
• feature sub-sample size (e.g. log2 |F| + 1)
Interpretation:
• logic behind predictions on individual instances can be
tediously followed through the various trees
• logic behind overall model: impossible to interpret
What if there are more than two attributes?
Use dimensionality -
reduction to reduce the feature space
Cons of dimensionality reduction
Dimensionality reduction cannot faithfully reproduce the original data
Feature selection is one form of dimensionality reduction, but it can less faithfully reproduce the original data than the method below.
PCA (Principal Component Analysis):
eliminate the inter-related attributes and retain the variation in the data as much as possible. After transformation, the new set of variables (PCs) are uncorrelated and ordered (so that the first few retain most of the variation in the original data).
Before using PCA on the dataset, normalize them first so the mean = 0.