Week 8 Flashcards

1
Q

What ML interpretation method separates the explanations from the machine learning model?

A

Model-agnostic interpretation methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the advantage of using model-agnostic interpretation methods over model-specific ones?

A

Their flexibility. The same method can be used for any type of model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the disadvantage of using only interpretable models instead of using model-agnostic interpretation methods?

A

Predictive performance is lost compared to other ML models, and you limit yourself to one type of model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are two alternatives to using model-agnostic interpretation methods?

A
  1. Use only interpretable models.
  2. Use model-specific interpretation methods.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the disadvantage of using model-specific interpretation methods compared to model-agnostic ones?

A

It binds you to one model type and it’s difficult to switch to something else.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name three flexibilities that are desirable aspects of a model-agnostic explanation system:

A
  1. Model flexibility
  2. Explanation flexibility
  3. Representation flexibility
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Model flexibility (as an aspect of a model-agnostic explanation system)

A

It can work with any ML model, such as random forests and deep neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explanation flexibility (as an aspect of a model-agnostic explanation system)

A

It’s not limited to a certain form of explanation. For example, linear formula and graphics with feature importances are both options.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Representation flexibility (as an aspect of a model-agnostic explanation system)

A

It’s able to use a different feature representation as the model being explained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can we further distinguish model-agnostic interpretation methods?

A

Into local and global methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What do global model-agnostic interpretation methods describe?

A

How features affect the prediction on average.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do local model-agnostic interpretation methods describe?

A

An individual prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How are global model-agnostic methods often expressed?

A

As expected values based on the distribution of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the partial dependence plot?

A

A feature effect plot: the expected prediction when all other features are marginalized out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When are global interpretation methods particularly useful?

A

When you want to understand the general mechanisms in the data or debug a model (since they describe average behavior).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

PDP (abbreviation)

A

Partial dependence plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

PD plot (abbreviation)

A

partial dependence plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the PDP show?

A

The marginal effect one or two features have on the predicted outcome of a ML model. Can show whether relationship between target and a feature is linear, monotonic or more complex.

19
Q

What does xs denote in the PD function for regression?

A

The features for which the PD function should be plotted.

20
Q

What does XC denote in the PD function for regression?

A

The other features (so non-xs features) used in the ML model ^f.

21
Q

How does PD work?

A

By marginalizing the ML model output over the distribution of the features in set C, so that the function shows the relationship between the features in set S we are interested in, and the predicted outcoe.

22
Q

Give the PD function for regression in the form of an expectation:

A

EXC[^f(xS, XC)].

23
Q

Give the PD function for regression in the form of an integral:

A

integral sign ^f(xS, XC) dP(XC).

24
Q

How is the partial function ^fS estimated?

A

By calculating averages in the training data, using the Monte Carlo method.

25
Give the partial function ^fS that is used in the PD function for regression:
^fS(xS) = 1/n *sum*(i=1 to n) ^f(xS,x(i)C ).
26
What does the partial function ^f in the PD function for regression tell us?
For given values of features S, it tells us what the average marginal effect on the prediction is.
27
What does x(i)C denote in the partial function ^f in the PD function for regression?
Actual features values from the dataset for the features in which we are not interested.
28
What is n in the partial function ^f in the PD function for regression?
The number of instances in the dataset.
29
What is the assumption of the PDP about the relationship between C and S?
The features in C are not correlated with the features in S.
30
What happens if the assumption that features in C are not correlated with features in S is violated in PDP?
The averages calculated for the PDP will include data points that are unlikely/impossible.
31
WHat does the PDP display for classification where the ML model outputs probabilities?
The probability for a certain class given different values for features in S.
32
What kind of model-agnostic method is the PDP?
A global method.
33
How do you calculate the partial dependence for categorical features?
Replace the feature value of all data instances with one value and average the predictions.
34
What does a flat PDP indicate?
The feature is not important.
35
How is importance of a feature defined in PDP for numerical features?
As the deviation of each unique feature value from the average curve.
36
What is the variable for the importance of numerical features?
I(xS)
37
Range rule
The way of calculating the deviation when you want a rough estimate and only know the range.
38
Why should the PDP-based feature importance be interpreted with care?
It captures only the main effect of the feature and ignores possible feature interactions.
39
Name three disadvantages of the PDP:
1. It doesn't show the feature distribution, so you might overinterpret regions with almost no data. 2. Assumption of independence. 3. Heterogeneous effects might be hidden (averaged out by marginalizing).
40
What does permutation feature importance measure?
The increase in the prediction error of the model after we permute the feature's values, which breaks the relation between the feature and the true outcome.
41
When is a feature important when using permutation feature importance?
If shuffling its values increases the model error, cuz then the model relied on the feature for the prediction.
42
Should you compute importance on training or test data?
Since permutation feature importance relies on measurements of the model error, you should use unseen test data to prevent overfitting.
43
Global surrogate model
An interpretable model that is trained to approximate the predictions of a black box model.
44