Topic 10: Rademacher Complexity: a measure of capacity Flashcards
What is Rademacher Complexity
A concept that helps us measure how well our model will do on new, unseen data without being tied to any specific training method
What can be said about the space of all possible f that an algorithm is effectively searching through
It can be infinite-dimensional space
Whilst the space of its parameters is not
How can we define F in terms of the algorithm
F ∶= the set of all possible models(predictors) mapping Rd → Rk,
that a given training algorithm is searching over
What is ALG
Any chosen training algorithm which takes as input the 3−tuple (ℓ,Sn, F) and searches through the above F for the “best” predictor
What is inf f∈F R(f)
The infimum (greatest lower bound = lowest lower bound) of the loss function
It is precisely want we want the loss function to be - as low as possible
What randomness do we account for
The randomness inherent in the training algorithm and introduced by sampling the training dataset from the entire dataset
There may be other sources of randomness that we’re not taking into account
What is stochasticity
Randomness or unpredictability in a process or system
What is arginf
The argument of the infimum. In mathematical terms, it represents the input value(s) that correspond to the infimum (the greatest lower bound) of a given function
ferm is the arginf of Rˆ(f,Sn)
How to we account for all randomness in ALG
Replacing ferm with ALG and average over all possible random outcomes of the algorithm to get a more comprehensive measure of risk
What do we know about R(ALG(ℓ,Sn, F))
Not much
it is a random variable that we do not know how to control
What is the worst generalisation gap between population and empirical risk
E(xi,yi)∼D [sup (Rˆ(f,Sn) − R(f))]
Trying to control this instead frees us from the details of any particular ML trying to find arginfR(f)
We instead are focused on the broader concept of how choices of predictors F and datasets D interact to find the best predictor f
What is Empirical Rademacher Complexity
R^n(H) =
Eϵ [sup 1/n ∑ϵi h (zi) ]
The empirical Rademacher complexity tells us how well the functions in H can adapt to random fluctuations in the dataset
It checks how much these functions change when we add random noise, and then taking the average of the maximum change across all functions in H (ϵi is the random noise)
What is Average Rademacher Complexity
Rn(H) =
E Sn∼Dn (Eϵ [sup 1/n ∑ϵi h(zi) ] )
A way to measure the complexity of the function class H when considering all possible datasets drawn from the distribution D
note z = (x,y) and h(z) = ℓ(y, f(x))
What does Sn ∼ Dn mean
The elements of Sn are sampled independently and identically from a common distribution D
What is the Expectation of the Worst Generalisation Gap equation
ESn∼Dn [sup 1/n ∑h(zi) − Ez∼D [h(z)] ) ] ≤ 2Rn(H)
The theorem is about understanding how well the functions in H generalize to unseen data compared to how they perform on the data they were trained on