Learning Model Flashcards
Introduce the supervised learning problem, 6 key points.
Supervised learning problem:
- Domain set X, which is the set of all possible objects to make predictions about where a domain point x in X is called instance and is usually a vector of features in R^n
- Label set Y, that defines the set of all possible labels
- Training set S, composed of m labeled points
- Learner’s Output h : X –> Y, which is the prediction rule
- Data-generation model: Instances are generated by some probability distribution and labeled according a function (both unknown)
- Measures of success: error of a classifier = probability it does not predict the correct label on a random data point generated by D
The Loss is the probability of randomly choosing an example x for which h(x) different from f(x)
Discuss one approach to pursue the supervised problem main objective in terms of Generalization Error (= Expected Risk) and Training Error (= Empirical Risk).
One approach to solve a supervised learning problem is through the empirical risk minimization. So given:
- Domain set X, which is the set of all possible objects to make predictions about where a domain point x in X is called instance and is usually a vector of features in R^n
- Label set Y, that defines the set of all possible labels
- Training set S, composed of m labeled points
we need to define an hypothesis class H, which defines the possible prediction rules h : X –> Y, and a loss function l : H x Z –> R+, where Z = X x Y that given an hypothesis h in H provides a measure of how much we lose by predicting the value h(x) for x instead of the correct value y.
The goal is then to find an hypothesis h^* in H which minimize the generalziation error LD,f
but since the probability distribution D and the labeling function f are unknown, we minimize the training error (empirical risk) Ls(h) = #ofwrongpredictions/#oftrainingsamples
Describe the regression task and classification task and their differences.
Answer as ERM approach but add:
In the regression task we define Y = R
In the binary classification task we define Y as the set of the two possible labels.
Definition of PAC learnability
An hypothesis class H is PAC learnable if there exists a function mh : (0,1)^2 –> N and a learning algorithm such that for every delta and epsilon in (0,1), for every distribution D over X and for every labeling function f: X –> {01}. if the realizability assumption holds wrt H,D,f; when running the learning algorithm on m >= mh(epsilon, delta) iid samples generated by D and labeled by f, the algorithm returns an hypothesis h such that, with probability >= 1 - delta (over the choice of examples): Ldf(h) <= epsilon
epsilon is the accuracy parameter: how far the output can be from the optimal one
delta is the confidence parameter: how likely the classifier is to meet that accuracy requirment
Sample complexity for PAC learnability of finite hypotheses classes
The sample complexity indicates how many examples are required to guarantee a probably approximately correct (PAC) solution.
Every finite hypothesis class is PAC learnable with sample complexity mh(epsilon, delta) <= ceiling(log(|H|/delta)/epsilon)
Definition of Agnostic PAC learnability for general loss functions
H is agnostic PAC learnable wrt a set Z and a loss function l : H x Z –> R+, if there exists a function mh: (0,1)^2 –> N and a learning algorithm such that for every delta, epsilon in (0,1), for every distribution D over Z, when running the learning algorithm on m >= mh(epsilon, delta) iid examples generated by D the algorithm returns a hypothesis h such that, with probability >= 1-delta (over the choice of the m training examples:
Ld(h) <= min Ld(h’) + epsilon
where Ld(h) = E[l(h,z)]
Definitions of general loss function, true and empirical error.
Common loss functions
Prove that the expectation (over the distribution of the training set) of Ls(h) is equal to Ld(h).
General Loss Functions: any function l : H x Z –> R+ where H is any hypotheses set and Z is some domain
Risk function = expected loss of a hypothesis h in H wrt D over Z: Ld(h) = E[l(h,z)]
Empirical risk: the expected loss over a given sample S in Z^m: Ls(h) = 1/m Sum l(h,zi)
Common loss functions:
- 0-1 loss : Z = X x Y. Used in binary or multiclass classification
l0-1(h,(x,y)) = 0 h(x) = y; 1 h(x) diff y
-Squared loss: Z = X x Y. Used in regression.
lsq(h,(x,y)) = (h(x)-y)^2
E[Ls(h)] = E[1/m SUM l] = 1/m E[SUM l] = 1/m SUM Ld(h) = Ld(h)