Loss_Functions_Preference_Levels (MITpaper) Flashcards
1
Q
types of target labels
A
- discrete, ordered labels (current paper)
- binary labels (classification)
- discrete, unordered (multi-class classification)
2
Q
problem definition
A
- regression problem with discrete, ordered labels
- treat it as generalization of binary regression (similar to logistic regression - a case with only 2 ordered labels: positive, negative)
3
Q
solution layout
A
- learn a real valued predictor z(x) i.e. binary linear regression
- minimize a loss function on target labels: loss(z(x); y)
- define generalizations, threshold-based and probabilistic, for:
- logistic loss*
- hinge loss*
4
Q
experiment method
A
- use L-2 regularized linear prediction that minimizes the trade-off between overall training loss and L-2 norm of the weights:
J(w) = Sum[ loss(w x_i + w_0; y_i) ] + (lmbd/2) * (L2N(w))^2
5
Q
binary regression - zero-one loss
A
- thresholding the real-valued predictor using sign(z(x)) and y in {0, 1}:
loss(z; y) = 0 if yz > 0 (NO error is made)
loss(z; y) = 1 if yz <= 0 - counts the number of errors
- not convex, not continuous => hard to minimize
- insensitive to magnitude of z, w => regularization ineffective: small w, w_0 have no effect on error (same) while regularization goes to zero
6
Q
binary regression - margin loss
A
- addresse magnitude insensitivity:
loss(z; y) = 0 if yz >= 1 (NO error is made)
loss(z; y) = 1 if yz < 1 - for y(w x + w_0) >= 1 (and dividing by |w|) => minimizing both the loss and regulariz term is equivalent to:
maximizing the margin 1 / |w| and minimizing the number of missclasified points - not convex, not continuous => hard to minimize
- insensitive to ERROR magnitude and applies the same regularization to all errors
7
Q
binary regression - hinge loss
A
- alternative to margin-loss bc of not convex, not continuous
- minimize the loss function, the hinge function:
loss(z; y) = max(0, 1 - yz) =
= 0 if yz >= 1
= 1 - yz if yz < 1 - also used as (in SVM): y(w x + w_0) >= 1 - eta w/ eta = margin violation
- IMPORTANT: is an upper-bound of zero-one classification error
- introduces a linear dependency on the ERROR magnitude (unavoidable for convex loss func)
8
Q
binary regression - smoothed hinge loss
A
- ‘smoothed’ losss function easier to minimize (smooth derivative)
loss(z; y) = 0 if yz >= 1 (NO error is made)
loss(z; y) = [(1 - yz)^2] / 2 if 0< yz < 1
loss(z; y) = 0.5 - yz if yz <= 0 - introduces a linear dependency on the ERROR magnitude (unavoidable for convex loss func)
9
Q
binary regression - logistic loss
A
loss(z; y) = log ( 1 + e ^ (-yz) ) = - log P(z | y)
- conditional log-loss likelihood: - log P(z | y) for
logistic conditional likelihood model (estimator):
P(y | z) ~ e ^ yz
- minimizing the Sum(loss(z; y)) ~ maximizing the conditional likelihood model among the models P(y | z)
- with L2 regularization term => MAP estimator with Gaussian prior on w
10
Q
generalization loss function
A
- loss(z; y) is a penalty
- applied to a classification margins yz
- using a specific margin penalty function f(.)
- k ordinal levels l_0 = -inf, l_1, …, l_{k-1} l_k = +inf
- replacing a single threshold, 0, with k-1 thresholds
11
Q
generalization - immmediate-threshold
A
- for each labeled example (x, y) there is only one correct segment (l_{y-1}, l_y)
- penalty: loss(z; y) = f(z - l_{y-1}) + f(l_y -z) for crossing the boundaries of the correct segment
- all errors are equally penalized regardless of the ordinal value
- f(.) is the margin penalty function
12
Q
generalization - all-threshold
A
- penalize more for ordinal values farther than the true one using:
s(m; y) = -1 if m < y
s(m; y) = +1 if m >= y
loss(z; y) = Sum_{m:1..k-1} f [ s(m; y) (l_y -z) ] - f(.) is the margin penalty function