1. Permute data 2. Calculate distance between permutations and original observations 3. Make predictions on new data using original model 4. Pick m features best describing the original model outcome from the permuted data 5. Fit a simple linear model to the permuted data with m features and similarities scores as weigths 6. Feature weights from the simple model make explanations for the complex models local behaviour

LIME Flashcards by Jonathan Larkin

What does LIME stand for?

Local (for each observation!; approx model locally)

Interpretable (simple for human to understand)

Model-agnostic (works for any)

Explanations

How well did you know this?

Not at all

Perfectly

What kind of models does LIME work for?

ANY classifier or regressor

Text explainer

Image explainer

How well did you know this?

Not at all

Perfectly

How does LIME work?

Permute data
Calculate distance between permutations and original observations
Make predictions on new data using original model
Pick m features best describing the original model outcome from the permuted data
Fit a simple linear model to the permuted data with m features and similarities scores as weigths
Feature weights from the simple model make explanations for the complex models local behaviour

How well did you know this?

Not at all

Perfectly

Why is interpretable ML important?

(LIME talk)

1) Trust: How can we trust the predictions are correct?
2) How can we understand and predict the behaviour?
3) How do we improve model to precent potential mistakes? Feature engineering.
4) GDPR: one aspect is that customer has right to an explanation in automated decision process
5) Choosing between competing models
6) Detect and improve untrustworthy models

How well did you know this?

Not at all

Perfectly

What is the idea of a “pick-step” in the model evaluation process?

In model evaluation; certain representative predictions are selected to be explained to the human by an “explainer” like LIME

How well did you know this?

Not at all

Perfectly

How does LIME work for image classification?

1) take a single image
2) Divide it into components
3) Make perturbed instances by turning components off (i.e., make them gray)
4) get Predictions on these perturbed instances
4) Learn a simple linear model on these perturbed images

How well did you know this?

Not at all

Perfectly

What is the LIME paper?

“Why Should I Trust You?” Explaining the Predictions of Any Classifier

Ribeiro, Singh, Guestrin

University of Washington

August 2016

How well did you know this?

Not at all

Perfectly

What are the three contributions of the LIME paper?

LIME, an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model.
SP-LIME, a method that selects a set of representative instances with explanations to address the “trusting the model” problem, via submodular optimization.
Comprehensive evaluation with simulated and human subjects, where we measure the impact of explanations on trust and associated tasks. In our experiments, non-experts using LIME are able to pick which classifier from a pair generalizes better in the real world. Further, they are able to greatly improve an untrustworthy classifier trained on 20 newsgroups, by doing feature engineering using LIME. We also show how understanding the predictions of a neural network on images helps practitioners know when and why they should not trust a model.

How well did you know this?

Not at all

Perfectly

What are three natural requirements for the interpretation model?

Local accuracy: the prediction of the explainer, g(x’), must match the prediciton of the base model, f(x)
Missingness: a simplied input of 0 in the explainer corresponds to toggling a feature off
Consistency: if toggling feature a off in one model always makes a bigger model than in another model, then the importance should be greater in the first model than in the second

How well did you know this?

Not at all

Perfectly

What can a 2-D projection of the data tell you?

1) clusters
2) sparsity
3) outliers
4) heirarchy

model should learn this if it does a good job

get an understanding of data to later check that model understands it too

How well did you know this?

Not at all

Perfectly

What does a correlation graph do for you?

1) Understand relationsips that a model should learn
2) see high demensionality relationship (relationships between variables).

How well did you know this?

Not at all

Perfectly

What is a Decision Tree Surrogate Model?

Take the inputs to a complex model, X, and the outputs of a complex model, y-hat, and train a single decision tree on it

How well did you know this?

Not at all

Perfectly

Why should you compare PDP and ICE lines?

If you see ICE lines criss-crossing with the PDP line then the PDP line may be misleading; interactions may be at play.

PDP shows average

look at it Side-by-side with Surrogate Decision Tree

How well did you know this?

Not at all

Perfectly

What are the characteristics of LIME, TreeInterpreter, and Shapley?

LIME can be used on any model (model agnostic; even deep learning)

Tree interpreter must be used on trees

Shapley is best for trees; takes row a data, follows path through tree, game theory approach; shapley is in XGBoost

In regulated industry: use Shapley

How well did you know this?

Not at all

Perfectly

What contributesd to gini importance?

Higher the variable appears and how ofter it appears contributed to gini importance.

How well did you know this?

Not at all

Perfectly

What is the importance of sensitivity analysis?

Study These Flashcards

To test extrapolation… test inputs outside the domain of inputs

Some important things about LIME:

Study These Flashcards

LIME has fit statistic: it tells you how much you should believe its own trusworthiness

LIME can fail in the case of extreme non-linearity or high-degree interactions

Reason codes are offsets from a local intercept

Try LIME on binned features and manually constructed interactions

Use CV to construct stdev or confidence intervals for reason codes.

What are the three tools for global tree undersanding?

What is the tools for single instance tree intepretation?

Study These Flashcards

Gini importance (mean decrease in impurity across all splits)
Permutation Importance (any model)
Partial Dependence Plot (any model)
TreeInterpreter

What is a global surrugate model?

Study These Flashcards

Fit a simple model (e.g., a single CART) from the model inputs to the model output

How does LIME work?

Study These Flashcards

1) take an invidual instance
2) make a bunch of new data by perturbing your intannce (e.g., features on and off)
3) ask classifier to make new predictions on perturbed
4) run locally weigthed regression (e.g., LASSO or ElasticNet)
5) the coefficients of that regression tell you the important features

How does Shapley Value Explanations work?

Study These Flashcards

Per instance, explains difference from mean

How much does each feature contribute to the pred?

What is pandas_profiling?

Study These Flashcards

Do this first!!

From Ian Ozswald

What is sklearn DummyClassifier?

Study These Flashcards

Just predicts the majority class!

Acts as a baseline model.

What does scikit-learn RepeatedKFold do for you?

Study These Flashcards

Does a loop x times with different splits. You can get a variance of the model prediciton in this case. A distrbution of model outputs. You should compare this ditribution when doing model selection.

What are Ian's steps for building a classifier

1) look at pandas\_profile 2) make a baseline 3) make an improved model 4) look at kFoldRepeated prediction distributions of both 5) Look at Confusion matrix; look at Confusion distribtion 6) Look at worst errors by row; look at the X matrix for all these errors; this shows data error; mis-codings of data come out 7) TSNE, find regions that cluster together with bad preds 8) SHAPley,

How do you do Permutation Importance in eli5?

import eli5 from eli5.sklearn import PermutationImportance perm = PermutationImportance(my\_model, random\_state=1).fit(**val\_X**, **val\_y**) eli5.show\_weights(perm, feature\_names = val\_X.columns.tolist())

What is the difference between * feature importance, * partial dependence, * SHAP values?

While **feature importance** shows **what** variables most affect predictions, **partial dependence** plots show **how** a feature affects predictions. This is useful to answer questions like: Controlling for all other house features, what impact do longitude and latitude have on home prices? To restate this, how would similarly sized houses be priced in different areas? Are predicted health differences between two groups due to differences in their diets, or due to some other factor? If you are familiar with linear or logistic regression models, **partial dependence plots can be interepreted similarly to the coefficients in those models**. Though, partial dependence plots on sophisticated models can capture more complex patterns than coefficients from simple models. **SHAP Values** (an acronym from SHapley Additive exPlanations) break down an individual prediction to show the impact of each feature! (can be used for model-level insights too)

What is the code to run a PDP plot?

from matplotlib import pyplot as plt from pdpbox import pdp, get\_dataset, info\_plots ``` feat\_name = 'pickup\_longitude' pdp\_dist = pdp.pdp\_isolate(model=first\_model, dataset=val\_X, model\_features=base\_features, feature=feat\_name) ``` pdp. pdp\_plot(pdp\_dist, feat\_name) plt. show()

What is the code to run a PDP interaction plot?

features\_to\_plot = ['Goal Scored', 'Distance Covered (Kms)'] inter1 = pdp.pdp\_interact(model=tree\_model, dataset=val\_X, model\_features=feature\_names, features=features\_to\_plot) pdp.pdp\_interact\_plot(pdp\_interact\_out=inter1, feature\_names=features\_to\_plot, plot\_type='contour') plt.show()

What does the y-axis represent in the PDP plot?

It is the change from the base level, i.e., the level from the immediate left of the feature value in question

Would you expect the PDP plot to change for featureX if you introduce a new important featureY?

YES! PDP is always with respect to a model. If you control for another important variable, the PDP will change.

Consider a scenario where you have only 2 predictive features, which we will call `feat_A` and `feat_B`. Both features have minimum values of -1 and maximum values of 1. The partial dependence plot for `feat_A` increases steeply over its whole range, whereas the partial dependence plot for feature B increases at a slower rate (less steeply) over its whole range. Does this guarantee that `feat_A` will have a higher permutation importance than `feat_B`. Why or why not?

No. This doesn't guarantee feat\_a is more important. For example, feat\_a could have a big effect in the cases where it varies, but could have a single value 99\% of the time. In that case, permuting feat\_a wouldn't matter much, since most values would be unchanged.

What will the PDP and Permutation Importance look like for X1: n\_samples = 20000 ``` X1 = 4 \* rand(n\_samples) - 2 X2 = 4 \* rand(n\_samples) - 2 y = X1\*X2 ```

PDP is flat. X1 will be very important This is becuase the effect of X1 on y is due to an interaction term only. X1 affects the prediction in order so it affects permutation importance. But the average effect is 0 so PDP is zero.

LIME Flashcards

(35 cards)