ML Flashcards

1
Q

Which of the following best describes the output of the
Metropolis-Hastings algorithm?

A

A random sample from the posterior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

After the clustering has converged, which output (i.e., parameter of K-Means)
needs to be recalculated outside the loop before returning the outputs such
that both outputs are consistent with each other?

A

Cluster assignments: In the loop, we finally recalculate the means and then check
convergence. Thus, we should recalculate the cluster assignments based on the final
set of means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In practice, it is common to apply PCA prior to K-Means. What is the main
motivation for this pre-processing?

A

Decorrelation: K-means does not use covariance information and further assumes
the features have no covariance. PCA is therefore primarily used to decoorrelate the
data. Reducing dimensionality may be a secondary motivation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between interpretability and explainability?

A

Interpretability concerns
models that are self-explanatory.
Explainability is used to provide
individual explanations to understand black-box models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which three properties you think are the most important for properties of
individual explanations?

A

Fidelity, Plausibility, Confidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which categories in the interpretable ML taxonomy given by Molnar apply to
(interpretation of) a linear regression model?

A

Modular, intrinsic, model-specific

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In Generalized Additive Models, how can we model quadratic feature interactions in component functions?

A

Using decision trees with limited depth to model component functions E.g. max
depth 2 if the chosen features are different, or more until at most 2 features are used
in the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Under what constraints (imposed during training) a GAM can generate a scoring model?

A

We restrict the GAM component functions to use only a weighted sum of indicator functions, where the weights are constrained to be integers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Given the illustration of the trained decision tree model for ‘Play Tennis’
task, provide an explanation to the decision of the model for the following example
using one of the explainability methods discussed in the lecture: [Outlook= Sunny,
Temperature=Mild, Humidity= Normal, Windy=True].
This example follows Outlook: Sunny → Humidity: Normal → Yes path.

A

IF (Outlook =‘Sunny’) AND (Humidity =
‘Normal’) THEN Output = Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Assume that two variables, namely xj and xk interact and that xj is a
causal ancestor of xk.
Why does PDP fall short in visualizing the univariate causal effect of xj on
the model?

A

Because if xk is dependent on xj , the expected value of xk will be determined by xj.
Hence when we iterate over xj independently, the dependent variable should be
arranged accordingly.
When we pass over the dataset for an arbitrary value of xj, generated samples will contain unrealistic combinations of these features and thus the statistics will not be correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Assume that two variables, namely xj and xk interact and that xj is a
causal ancestor of xk.
How can we handle this issue using PDP?

A

A solution would be to analyze PDP for these interacting variables jointly.
Even if a grid search is conducted to generate the value pairs, the univariate histograms shown over the axes can reveal which combinations are realistic.
Moreover, if the causal interaction is known, the dependent variable among the two can be chosen from the conditional distribution p(xk|xj ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In permutation feature importance, the assumption is that when perturbed,
more influential features cause ………… error.

A

higher

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the drawbacks of global surrogates?

A

They may not model the global complexity of the original model.
They may not model feature interactions.
Results may reflect their own structural bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which term in the optimization objective of LIME aims to ensure ‘local fidelity’
with respect to an instance of interest?

A

pi_k(z)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the open problems concerning LIME?

A

measuring similarity / proximity
instance sampling
choosing the interpretable version depending on task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Assume that you work as a machine learning expert in a company and
there are a range of critical tasks on which black-box models are running. You are
tasked with building an explanation system for each of these models using ‘LIME’.
What would your choice of an ‘interpretable representation’ for the models with the
following original input representations be?
Binary tabular input: x in {0,1}^d, where d is feature dimensionality.

A

This is the simple, interpretable representation that we would like to get. Therefore,
no transformation is needed.

17
Q

Assume that you work as a machine learning expert in a company and
there are a range of critical tasks on which black-box models are running. You are
tasked with building an explanation system for each of these models using ‘LIME’.
What would your choice of an ‘interpretable representation’ for the models with the
following original input representations be?
Continuous tabular input: x in R^d

A

We can binarize the continuous features/input using a suitable threshold. The respective threshold can be the training set mean or median statistic that measure the
central tendency. This is very application dependent, too. For example, in a medical
setting, a value withing the normal range can be represented as 0 and any value
beyond charts would be 1.

18
Q

Assume that you work as a machine learning expert in a company and
there are a range of critical tasks on which black-box models are running. You are
tasked with building an explanation system for each of these models using ‘LIME’.
What would your choice of an ‘interpretable representation’ for the models with the
following original input representations be?
Free text input

A

We can get a binary bag-of-words (BoW) representation, where a word is represented
with a 1 if it appears in the corresponding document/input to explain, and a 0 otherwise.

19
Q

Assume you work in an HR consultancy company and you are tasked with
developing an interpretable applicant classification model for a given position. There
are two binary and three continuous features and the target variable (invite to
interview) is binary. What would you do? Which model family / classification
method would you use? Explain your answer.

A

First the answer should be an intrinsically interpretable model, such as logistic regression (LR), SVM with a linear kernel, GAM or a Decision Tree (DT).
Among these, GAM is the most suitable as it can also handle the feature interactions, while
remaining interpretable without any preprocessing.
GAMs, SVM and LR can process the binary/continuous input without any issues and can yield interpretable models.
However, for the DT to yield an interpretable model, we need to discretize the continuous features.
This can be done via binarization using the training set mean as threshold.