Interpreting/Visualising Data & Models Flashcards

Question 1

Q

error analysis

Answer

A

why is it that a given model has misclassified an

instance in the way it has?

Question 2

Q

model interpretability

Answer

A

why is it that a given model has classified an instance in

the way it has?

Question 3

Q

How to do error analysis?

Answer

A

identifying different “classes” of error that the system makes
hypothesising as to what has caused the different errors, and testing those hypotheses against the actual data
often feeding those hypotheses back into feature/model engineering to see if the model can be improved any (error analysis over only the dev data)
quantifying whether (for different classes) it is a question of data quantity/sparsity, or something more fundamental

Remember to:

test hypotheses against your data
where possible, use the model to guide the error analysis

(a) a confusion matrix
(b) a random subsample of
misclassified instances

Question 4

Q

hyperparameter

Answer

A

parameters which define/bias/constrain the learning processor

Question 5

Q

parameters

Answer

A

what is learned when a
given learner with a given set of hyperparameters is applied
to a particular training dataset

Question 6

Q

model, parameters & hyperparameters

Answer

A

A model (trained with a given set of hyperparameters) can
then be interpreted relative to the parameters associated
with a given test instance

Question 7

Q

NN

Answer

A

Hyperparameters:
• k (neighbourhood size)
• distance/similarity metric
• feature weighting/selection …

Parameters:
• there are none, as the model is “lazy” and doesn’t abstract
away from the training instances in any way

Interpretation:
• relative to the training instances that give rise to a given classification, and their geometric distribution

Question 8

Q

NP

Answer

A

Hyperparameters:
• distance/similarity metric used to calculate the prototype,
and distance to each prototype in classification
• feature weighting/selection …

Parameters:
• the prototype for each class
• size = O(|C||F|)
• C = set of classes
• F = set of features

Interpretation:
• relative to the geometric distribution of the prototypes, and distance to each for a given test instance

Question 9

Q

NB

Answer

A

Hyperparameters:
• the choice of smoothing method
• optionally the choice of distribution used to model the
features (e.g. binomial per feature, or multinomial over all
features)

Parameters:
• the class priors and conditional probability for each
feature–value–class combination
• size = O(|C| + |C||FV|)
• C = set of classes
• FV = set of feature–value pairs

Interpretation:
• usually based on the most positively-weighted features
associated with a given instance

Question 10

Q

Decision Tree

Answer

A

Hyperparameters:
• the choice of function used for attribute selection
• the convergence criterion

Parameters:
• the decision tree itself
• worst-case size = O(V|Tr|)
typical size = O(|FV|)
• V = average branching factor
• Tr = set of training instances
• FV = set of feature–value pairs

Interpretation:
• based directly on the path through the decision tree

Question 11

Q

SVM

Answer

A

Hyperparameters:
• the penalty term C/ for soft-margin SVMs
• feature value scaling
• the choice of kernel (and any hyperparameters associated
with it)

Parameters:
• vector of feature weights + bias term
• size = O(|C||F|) (assuming one-vs-rest SVM)
• C = set of classes
• F = set of features

Interpretation:
• the absolute value of the weight associated with each
non-zero feature in a given instance provides an indication
of its relative importance in classification (modulo kernel
projection)

Question 12

Q

Random Forest

Answer

A

Hyperparameters:
• number of trees B (can be tuned, e.g. based on “out-of-bag” error rate)
• feature sub-sample size (e.g. log2 |F| + 1)

Interpretation:
• logic behind predictions on individual instances can be
tediously followed through the various trees
• logic behind overall model: impossible to interpret

Question 13

Q

What if there are more than two attributes?

Answer

A

Use dimensionality -

reduction to reduce the feature space

Question 14

Q

Cons of dimensionality reduction

Answer

A

Dimensionality reduction cannot faithfully reproduce the original data

Feature selection is one form of dimensionality reduction, but it can less faithfully reproduce the original data than the method below.

Question 15

Q

PCA (Principal Component Analysis):

Answer

A

eliminate the inter-related attributes and retain the variation in the data as much as possible. After transformation, the new set of variables (PCs) are uncorrelated and ordered (so that the first few retain most of the variation in the original data).

Before using PCA on the dataset, normalize them first so the mean = 0.

Question 16

Q

Cons of PCA

Answer

Study These Flashcards

A

Linear relationship only
Relies on orthogonal transformation
Large variance = important (good for noise removal tho): counter-example = pancake; we may discard attributes that are useful because it has low variance
Sensitive to scale

Question 17

Q

Why Visualization

Answer

Study These Flashcards

A

Get to know the data

* Detect outliers

Interpreting/Visualising Data & Models Flashcards

(17 cards)