PAC Learning Flashcards
What is proof by contraposition?
What’s the intuition of the VC Dimension?
Even if you have infinitely many hypotheses in your hypothesis class, given some training sample, many of those hypotheses will look functionally the same wrt that specified sample
What is h? What does it do?
- A hypothesis.
- Applied to some dataset S, generates a labeling of S
What is S?
the dataset
For the realizable setting, what are the relations between the bound and 1) epsilon, 2) abs(H)?
- Bound is inversely linear in epsilon (e.g. halving the error requires double the examples)
- Bound is only logarithmic in |H| (e.g. quadrupling the hypothesis space only requires double the examples)
For the agnostic setting, what are the relations between the bound and
1) epsilon,
2) abs(H)
- Bound is inversely quadratic in epsilon (e.g. halving the error requires 4x the examples)
- Bound is only logarithmic in |H| (i.e. same as Realizable case)
What is shattering?
The hypotheses in H can perfectly classify every possible labeling of S
What is the VC-dimension?
Def: The VC-dimension (or VaporikChervonenkis dimension) of ℋ is the cardinality of the largest set “ such that ℋ can shatter “.
Or the definition from the recitation. Get rid of one of these
VC dimension of a hypothesis space H is the maximum number of points such that there exists at least one arrangement of these points and a hypothesis h ∈ H that is consistent with any labeling of this arrangement of points
When is the VC dimension infinity?
If ℋ can shatter arbitrarily large finite sets, then the VC-dimension of ℋ is infinity
To prove that that VC(H) = some value M, what do you have to do?
What’s the vc dimension for separators in n dimensions?
n+1
What does ∃ mean?
There exists
(high level) What’s the corollary to Theorem 1 of the PAC theorem?
Give a numerical bound of the true error
What’s the key idea that makes Correlary 4 of theorem 1 of pac learning useful?
- We want to tradeoff between low training error and keeping H simple (aka low VC(H))
- We can tune the lambda parameter in the regularize to hopefully get us to land at the sweet spot in the graph
What are the practical ways we can tradeoff between low training error and keeping H simple? (1)
Use a regularizer
What are discriminative models?
What’s a pmf?
p(x) : Function giving the probability that discrete r.v. X takes value x.