Topic 11: Bounding Rademacher Complexity Flashcards
What is Massart’s Finite Class Lemma
provides a bound on the empirical Rademacher complexity
the lemma states that if we have a finite set of functions and we know the maximum norm of the evaluations of these functions on a dataset (M), then we can bound the empirical Rademacher complexity of the set
How can Massarts Lemma be applied to linear regression
The lemma deals with a finite set of functions H - the set of linear predictors defined by quantized weight vectors and associated loss functions
It provides a bound on the expected risk of this finite class of functions
It provides a theoretical guarantee on the performance of these models
What is the Jensens Lemma (v1)
Let F: R -> R be a convex function and let X be a real-valued random variable
F(E[X]) ≤ E[F(X)]
What is jensens lemma (v2)
Let f : R -> R be a convex function, X is a n-dimensional vector random variable, T: Rn -> R is a given function:
F(E[T(X)]) ≤ E[F(T(X))]
What does it mean for a weight vector to be bounded by its 2-norm
The 2-norm of w is:
∥w∥2 = √w1^2 + w2^2… + wn^2
If it is bounded then:
∥w∥2 ≤ B
What is important about rademacher for 2-norm bounded linear predictors
The bound on rademacher complexity does not scale with the dimension of the data d
What is rademacher for 2-norm bounded linear predictors dependent on
Value B - the norm bound
What happens to average Rademacher and empirical rademacher complexity when n goes to infinity
They vanish
(at different rates)
This decrease implies as we gather more data, the generalisation gap tends to narrow
What is the general eqn for rademacher complexity of 2-norm bounded linear predictors
Rn(F) ≤ B . C / √n
B = 2-norm bound on the weights of the linear functions
C = constant that bounds the expected squared of the data points
n = number of data points
How can we use rademacher complexity of 2-norm bounded linear predictors to show size is not always better
it is not related to d
What is the generalisation gap
difference between the performance of the model on the training data and its performance on the test data
small generalization gap -> model successfully learned to generalize from the training data to unseen data
large gap -> overfitting
What is the generalisation error
Difference between the training error and test error
What does it mean for a neural network to become wider
Have more parameters
What is surprising about increasingly wide nets (property 1)
The (estimated) Lipschitz constant of the trained nets will neither be very small nor be decreasing
– might even be moderately increasing as the number of parameters increase.
What is surprising about increasingly wide nets (property 2)
Such good nets can be found with as many parameters as,
data−dimension × number−of−training−data = d × n, or even more.