ML 2 VO - Deck 1 Flashcards
Expectation
averaged value of a random varialbe
decision theory
What is the optimal performance of a machine learning algorithm given a perfect knowledge of the training data?
entropy
H[x]
the average information content or the expected amount of information from a random variable X
h(x)=−logp(x)
information content
rare events carry more information than common events
negation ensures that the information content is non-negative
loss functions
assess the performance of a learning algorithm
computes loss between predicted and true label/value
the user defines the loss function
binary classification: 0-1 loss
multiclass classification: 0-1 loss
regression: square loss/absolute loss
structured prediction: hamming loss
risks
aka generalization performance / testing error
exected value of the loss function (average of the loss over all possible outcomes, weighted by their probabilities)
expected risk: expectated loss between output and prediction (deterministic function)
empirical risk: training error (random function)
bayes risk
minimum possible expected loss (risk)
What is the best prediction function f(x) regardless of the data?
trying to minimize R(f) for the predictor function f by setting f(x’) equal to a z ∈ Y that minimizes r(z|x’) independentlyfor all x’
r(z|x’) = E_y[ l(y,z) | x=x’]
bayes predictors
the expected risk is minimized at a bayes predictor, the bayes predictor achieves the bayes risk
Bayes predictor is not always unique but the value of the Bayes risk is the same for all bayes predictors
The Bayes risk is usually unable to minimize to zero due to noise in the data (see below).
Excess Risk
the difference between the risk of a given decision rule or predictor and the Bayes risk
Expected “0-1” loss for classication
Bayes predictor for Y = {0; 1}
The corresponding Bayesian predictor for the 0-1 loss is the Maximum A Posteriori (MAP) classifier,
decision rule that selects the class with the highest posterior probability P(y|x)
Empirical Risk Minimization (ERM):
find the best model or predictor by minimizing the empirical risk (=average loss over a given set of training data)
by minimizing this risk, model should also perform well on unseen data (generalization)
Risk decomposition
risk is decomposed intoan estimation error and approximation error
Capacity Control
Deals with controlling the complexity or expressive power of the model or hypothesis space
balance the trade-off fitting the training data and ensuring good generalization, avoid overfitting or underfitting, or wrong model because of high bias
restrict number of parameters or norm
model selection, regularization, early stopping, cross-validation
Statistical Learning Theory
provide guarantees for the performance of a predictor on unseen data and understand the factors affecting their generalization ability
empirical risk minimization, capacity control, risk decomposition, bias-variance trade-off
Fixed design vs random design
fixed design:
assume input data is not random, small prediction error on that data only, predictor variables are fixed
random design:
both input and output are random, goal is good generalization, predictor variables are random variables, assumed from distribution
Statistical properties of the OLS estimator
Unbiasedness
among all linear unbiased estimators, it has the smallest variance