ML 2 VO - Deck 1 Flashcards by David Moling

Expectation

averaged value of a random varialbe

How well did you know this?

Not at all

Perfectly

decision theory

What is the optimal performance of a machine learning algorithm given a perfect knowledge of the training data?

How well did you know this?

Not at all

Perfectly

entropy

H[x]
the average information content or the expected amount of information from a random variable X

h(x)=−logp(x)
information content
rare events carry more information than common events
negation ensures that the information content is non-negative

How well did you know this?

Not at all

Perfectly

loss functions

assess the performance of a learning algorithm
computes loss between predicted and true label/value

the user defines the loss function
binary classification: 0-1 loss
multiclass classification: 0-1 loss
regression: square loss/absolute loss
structured prediction: hamming loss

How well did you know this?

Not at all

Perfectly

risks

aka generalization performance / testing error

exected value of the loss function (average of the loss over all possible outcomes, weighted by their probabilities)

expected risk: expectated loss between output and prediction (deterministic function)

empirical risk: training error (random function)

How well did you know this?

Not at all

Perfectly

bayes risk

minimum possible expected loss (risk)

What is the best prediction function f(x) regardless of the data?

trying to minimize R(f) for the predictor function f by setting f(x’) equal to a z ∈ Y that minimizes r(z|x’) independentlyfor all x’

r(z|x’) = E_y[ l(y,z) | x=x’]

How well did you know this?

Not at all

Perfectly

bayes predictors

the expected risk is minimized at a bayes predictor, the bayes predictor achieves the bayes risk

Bayes predictor is not always unique but the value of the Bayes risk is the same for all bayes predictors

The Bayes risk is usually unable to minimize to zero due to noise in the data (see below).

How well did you know this?

Not at all

Perfectly

Excess Risk

the difference between the risk of a given decision rule or predictor and the Bayes risk

How well did you know this?

Not at all

Perfectly

Expected “0-1” loss for classication

Bayes predictor for Y = {0; 1}
The corresponding Bayesian predictor for the 0-1 loss is the Maximum A Posteriori (MAP) classifier,
decision rule that selects the class with the highest posterior probability P(y|x)

How well did you know this?

Not at all

Perfectly

Empirical Risk Minimization (ERM):

find the best model or predictor by minimizing the empirical risk (=average loss over a given set of training data)
by minimizing this risk, model should also perform well on unseen data (generalization)

How well did you know this?

Not at all

Perfectly

Risk decomposition

risk is decomposed intoan estimation error and approximation error

How well did you know this?

Not at all

Perfectly

Capacity Control

Deals with controlling the complexity or expressive power of the model or hypothesis space

balance the trade-off fitting the training data and ensuring good generalization, avoid overfitting or underfitting, or wrong model because of high bias

restrict number of parameters or norm

model selection, regularization, early stopping, cross-validation

How well did you know this?

Not at all

Perfectly

Statistical Learning Theory

provide guarantees for the performance of a predictor on unseen data and understand the factors affecting their generalization ability

empirical risk minimization, capacity control, risk decomposition, bias-variance trade-off

How well did you know this?

Not at all

Perfectly

Fixed design vs random design

fixed design:
assume input data is not random, small prediction error on that data only, predictor variables are fixed

random design:
both input and output are random, goal is good generalization, predictor variables are random variables, assumed from distribution

How well did you know this?

Not at all

Perfectly

Statistical properties of the OLS estimator

Unbiasedness
among all linear unbiased estimators, it has the smallest variance

How well did you know this?

Not at all

Perfectly

difference dimension dependent to dimension independent excess RLS risk

Study These Flashcards

improved generalization, adapt to different dimensionalitites better

explain kernels, popular kernels

Study These Flashcards

no feature transform -> but use a (kernel) function that returns the inner product direcly

allows us to apply linear classifiers to non-linear problems by mapping non-linear data into a higher-dimensional space

Kernels measure the similarity or inner product between two data points in high dimensional space

popular kernels: uniform, linear, polynomial/quadratic/epanechnikov, Gaussian

SVM

Study These Flashcards

find margin that is smallest perpendicular distance from the hyperplane to the closest data points

rademacher complexity

Study These Flashcards

quantifies the ability of the function class to fit random noise and is often used in statistical learning theory to derive generalization bounds

expectation of the maximal dot-product between a and noise epsilon

rademacher complexity

Study These Flashcards

quantifies ability of function class to fit random noise
used in statistical learning theory to derive generalization bounds

expectation of the maximal dot-product between a and noise epsilon

can be used to derive data-dependent upper bounds on the learnability of function classes

function class with smaller rademacher is easier to learn

Lipschidtz constant

Study These Flashcards

measure of how much the function can change with respect to a change in its input

provides certain guarantees on the stability and smoothness of the function

When are sparse methods useful?

Study These Flashcards

when the feature vector phi is large
assume that only small number of features are relevant

Which two types of sparsity inducing regularization terms do you know?

Study These Flashcards

l0 penalty: non convex, counts number of non zero elements in feature vector
l1 penalty: convex, absolute values of the weights

LASSO

Study These Flashcards

least absolute shrinkage and selection operator

What is the advantage of the l1 regularized model over the l0 regularized model?

Convexity: -> global minimum more efficient (l0 are np hard, stability: less prone to overfitting

FISTA Fast Iterative Shrinkage-Thresholding Algorithm

accelerated version of Proximal Gradient Descent that uses a momentum term to speed up convergence

ML 2 VO - Deck 1 Flashcards

(26 cards)