Interpreting/Visualising Data & Models Flashcards

1
Q

error analysis

A

why is it that a given model has misclassified an

instance in the way it has?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

model interpretability

A

why is it that a given model has classified an instance in

the way it has?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to do error analysis?

A
  • identifying different “classes” of error that the system makes
  • hypothesising as to what has caused the different errors, and testing those hypotheses against the actual data
  • often feeding those hypotheses back into feature/model engineering to see if the model can be improved any (error analysis over only the dev data)
  • quantifying whether (for different classes) it is a question of data quantity/sparsity, or something more fundamental

Remember to:

  1. test hypotheses against your data
  2. where possible, use the model to guide the error analysis

(a) a confusion matrix
(b) a random subsample of
misclassified instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

hyperparameter

A

parameters which define/bias/constrain the learning processor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

parameters

A

what is learned when a
given learner with a given set of hyperparameters is applied
to a particular training dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

model, parameters & hyperparameters

A
A model (trained with a given set of hyperparameters) can
then be interpreted relative to the parameters associated
with a given test instance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

NN

A

Hyperparameters:
• k (neighbourhood size)
• distance/similarity metric
• feature weighting/selection …

Parameters:
• there are none, as the model is “lazy” and doesn’t abstract
away from the training instances in any way

Interpretation:
• relative to the training instances that give rise to a given classification, and their geometric distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

NP

A

Hyperparameters:
• distance/similarity metric used to calculate the prototype,
and distance to each prototype in classification
• feature weighting/selection …

Parameters:
• the prototype for each class
• size = O(|C||F|)
• C = set of classes
• F = set of features

Interpretation:
• relative to the geometric distribution of the prototypes, and distance to each for a given test instance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

NB

A

Hyperparameters:
• the choice of smoothing method
• optionally the choice of distribution used to model the
features (e.g. binomial per feature, or multinomial over all
features)

Parameters:
• the class priors and conditional probability for each
feature–value–class combination
• size = O(|C| + |C||FV|)
• C = set of classes
• FV = set of feature–value pairs

Interpretation:
• usually based on the most positively-weighted features
associated with a given instance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Decision Tree

A

Hyperparameters:
• the choice of function used for attribute selection
• the convergence criterion

Parameters:
• the decision tree itself
• worst-case size = O(V|Tr|)
typical size = O(|FV|)
• V = average branching factor
• Tr = set of training instances
• FV = set of feature–value pairs

Interpretation:
• based directly on the path through the decision tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SVM

A

Hyperparameters:
• the penalty term C/ for soft-margin SVMs
• feature value scaling
• the choice of kernel (and any hyperparameters associated
with it)

Parameters:
• vector of feature weights + bias term
• size = O(|C||F|) (assuming one-vs-rest SVM)
• C = set of classes
• F = set of features

Interpretation:
• the absolute value of the weight associated with each
non-zero feature in a given instance provides an indication
of its relative importance in classification (modulo kernel
projection)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Random Forest

A

Hyperparameters:
• number of trees B (can be tuned, e.g. based on “out-of-bag” error rate)
• feature sub-sample size (e.g. log2 |F| + 1)

Interpretation:
• logic behind predictions on individual instances can be
tediously followed through the various trees
• logic behind overall model: impossible to interpret

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What if there are more than two attributes?

A

Use dimensionality -

reduction to reduce the feature space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cons of dimensionality reduction

A

Dimensionality reduction cannot faithfully reproduce the original data

Feature selection is one form of dimensionality reduction, but it can less faithfully reproduce the original data than the method below.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PCA (Principal Component Analysis):

A

eliminate the inter-related attributes and retain the variation in the data as much as possible. After transformation, the new set of variables (PCs) are uncorrelated and ordered (so that the first few retain most of the variation in the original data).

Before using PCA on the dataset, normalize them first so the mean = 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cons of PCA

A
  • Linear relationship only
  • Relies on orthogonal transformation
  • Large variance = important (good for noise removal tho): counter-example = pancake; we may discard attributes that are useful because it has low variance
  • Sensitive to scale
17
Q

Why Visualization

A
  • Get to know the data

* Detect outliers