ML 2 VO - Deck 1 Flashcards

1
Q

Expectation

A

averaged value of a random varialbe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

decision theory

A

What is the optimal performance of a machine learning algorithm given a perfect knowledge of the training data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

entropy

A

H[x]
the average information content or the expected amount of information from a random variable X

h(x)=−logp(x)
information content
rare events carry more information than common events
negation ensures that the information content is non-negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

loss functions

A

assess the performance of a learning algorithm
computes loss between predicted and true label/value

the user defines the loss function
binary classification: 0-1 loss
multiclass classification: 0-1 loss
regression: square loss/absolute loss
structured prediction: hamming loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

risks

A

aka generalization performance / testing error

exected value of the loss function (average of the loss over all possible outcomes, weighted by their probabilities)

expected risk: expectated loss between output and prediction (deterministic function)

empirical risk: training error (random function)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

bayes risk

A

minimum possible expected loss (risk)

What is the best prediction function f(x) regardless of the data?

trying to minimize R(f) for the predictor function f by setting f(x’) equal to a z ∈ Y that minimizes r(z|x’) independentlyfor all x’

r(z|x’) = E_y[ l(y,z) | x=x’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

bayes predictors

A

the expected risk is minimized at a bayes predictor, the bayes predictor achieves the bayes risk

Bayes predictor is not always unique but the value of the Bayes risk is the same for all bayes predictors

The Bayes risk is usually unable to minimize to zero due to noise in the data (see below).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Excess Risk

A

the difference between the risk of a given decision rule or predictor and the Bayes risk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Expected “0-1” loss for classication

A

Bayes predictor for Y = {0; 1}
The corresponding Bayesian predictor for the 0-1 loss is the Maximum A Posteriori (MAP) classifier,
decision rule that selects the class with the highest posterior probability P(y|x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Empirical Risk Minimization (ERM):

A

find the best model or predictor by minimizing the empirical risk (=average loss over a given set of training data)
by minimizing this risk, model should also perform well on unseen data (generalization)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Risk decomposition

A

risk is decomposed intoan estimation error and approximation error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Capacity Control

A

Deals with controlling the complexity or expressive power of the model or hypothesis space

balance the trade-off fitting the training data and ensuring good generalization, avoid overfitting or underfitting, or wrong model because of high bias

restrict number of parameters or norm

model selection, regularization, early stopping, cross-validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Statistical Learning Theory

A

provide guarantees for the performance of a predictor on unseen data and understand the factors affecting their generalization ability

empirical risk minimization, capacity control, risk decomposition, bias-variance trade-off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Fixed design vs random design

A

fixed design:
assume input data is not random, small prediction error on that data only, predictor variables are fixed

random design:
both input and output are random, goal is good generalization, predictor variables are random variables, assumed from distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Statistical properties of the OLS estimator

A

Unbiasedness
among all linear unbiased estimators, it has the smallest variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

difference dimension dependent to dimension independent excess RLS risk

A

improved generalization, adapt to different dimensionalitites better

17
Q

explain kernels, popular kernels

A

no feature transform -> but use a (kernel) function that returns the inner product direcly

allows us to apply linear classifiers to non-linear problems by mapping non-linear data into a higher-dimensional space

Kernels measure the similarity or inner product between two data points in high dimensional space

popular kernels: uniform, linear, polynomial/quadratic/epanechnikov, Gaussian

18
Q

SVM

A

find margin that is smallest perpendicular distance from the hyperplane to the closest data points

19
Q

rademacher complexity

A

quantifies the ability of the function class to fit random noise and is often used in statistical learning theory to derive generalization bounds

expectation of the maximal dot-product between a and noise epsilon

20
Q

rademacher complexity

A

quantifies ability of function class to fit random noise
used in statistical learning theory to derive generalization bounds

expectation of the maximal dot-product between a and noise epsilon

can be used to derive data-dependent upper bounds on the learnability of function classes

function class with smaller rademacher is easier to learn

21
Q

Lipschidtz constant

A

measure of how much the function can change with respect to a change in its input

provides certain guarantees on the stability and smoothness of the function

22
Q

When are sparse methods useful?

A

when the feature vector phi is large
assume that only small number of features are relevant

23
Q

Which two types of sparsity inducing regularization terms do you know?

A

l0 penalty: non convex, counts number of non zero elements in feature vector
l1 penalty: convex, absolute values of the weights

24
Q

LASSO

A

least absolute shrinkage and selection operator

25
Q

What is the advantage of the l1 regularized model over the l0 regularized model?

A

Convexity: -> global minimum
more efficient (l0 are np hard,
stability: less prone to overfitting

26
Q

FISTA Fast Iterative Shrinkage-Thresholding Algorithm

A

accelerated version of Proximal Gradient Descent that uses a momentum term to speed up convergence