Reliable AI Flashcards

Question 1

Q

data shift (concept shift/drift, fracture point)

Answer

A

when the training and the test distributions are different
lose confidence in the ability to predict
causes:
1. sample selection bias
2. non-stationary (spatial or temporal) shifts in data
formula: P_test(y, x) != P_train(y, x)

Question 2

Q

covariate shift

Answer

A

change in covariate (independent variable) distributions
causes: caused by
- temporal, spatial changes
- data sparsity
- biased feature selection
- class shift
formula: P_test(y|x) = P_train(y|x) and P_train(x)≠P_test(x)

Question 3

Q

prior probability shift

Answer

A

Question 4

Q

concept shift

Answer

A

related to the relationship between the input and output distribution
e.g. before and after financial crises needs a concept shift
formula:
- P_train(y|x)≠P_test(y|x) and P_train(x)=P_test(x), for X→Y
- P_train(x|y)≠P_test(x|y) and P_train(y)=P_test(y), for Y→X

Question 5

Q

internal covariate shift

Answer

A

variation shift of the activations
batch normalization should hasten learning (also regularizes the input by adding noise)

Question 6

Q

statistical similarity

Answer

A

non-parametric: kl divergence, jensen-shannon divergence, population stability index, wasserstein distance
multi-variate: for high dimensional and unstructured datasets
- isolation forest
- KD Trees
- variational autoencoder
- normalization flow

Question 7

Q

statistical distance

Answer

A

compare histograms of training data over time
- population stability index:
- kolmogorov-smirnov statistic:
- Kullback-Lebler Divergence:
problems:
- not good for high-dimensional
- not good for sparse features

Question 8

Q

novelty detection

Answer

A

Question 9

Q

discriminative distance

Answer

A

less common method
train classifier to detect whether point is from the source domain
use training error as proxy of distance between distributions (high error means closer)
pros:
- may be only feasible solution for some deep learning methods (e.g. NLP)
- good for sparse data
- good for high dimensions
problems:
- can only be done offline
- more complicated than other methods

Question 10

Q

how to handle dataset shift

Answer

A

remove features: set a boundary of acceptable shift, remove and retrain if too much shift
1. remove if it is no important
reweigh importance: upweight training data more similar to test instance
adversarial search: uses adversarial model that is robust to feature deletion