Introduction Flashcards
What is machine learning?
A computer program is said to learn from experience E with respect to some tasks T and performance meassure P, if its performance at task T with respect to P improves with experience E.
Will regression with the L1 or L2 loss be most affected by outliers?
L2.
What can we do in the linear regression setting if XX^T isn’t invertible (We have more features than datapoints)?
Add a regularization term to the loss function.
What is the generalization gap?
The difference between generalization (test) and training error.
Why do we usually not use the step (0-1) activation function?
Hard to optimize since it isn’t continues.
What is the approximation error?
The error coming from the function approximation, if the function is too “simple” it can’t approximatet the underlying function perfectly.
What is estimation error (or generalization gap)?
The error from using empircal mean instead of true expectation
What is optimization error?
The error resulting from finding a local, not global minimizer.
Why would we use SGD instead of GD optimizer?
The computational and memory complexity is smaller for SGD. ( Also SGD can lead to more exploring).
What is the advantage of backprop?
Backprop describes how to calculate the gradient in an algorithm way and it often avoids calculating the same thing several times.
How can we guaranty that a minima is a global minima
If the function is convex and the Hessian is positive semi-definit.
Why does gradient descent work in neural networks, even though they are non-convex?
The difference between local and global minima is usually quite small.
How is the optimization affected by small and large step size?
Small:
Slower inital convergence, more accurate end result
Large:
faster inital convergence, less accurate end result
What is the probabilistic interpretation of L2 regularization
MAP with normal prior
What is the probabilistic interpretation of L1 regularization?
MAP with laplacian prior
What do Dropout, Early stopping, data augmentation and normalization techniques have in common?
All of them have regularizing effects.
What kind of regularization do we get by adding noise to the input?
Weight decay (L2)
What is SGD with momentum?
We add a fraction of the last step to the gradient descent.
What is the idea behind Adagrad?
if the gradient along dimension i is given by g_i, we divide the update by sqrt(g_i^2). This will result in different step size for different dimensions.
What is the idea behind RMSprop (AdaDelta)?
The same as Adagrad, but using the average decaying average of gradients instead?
What is the idea behind Adam
Combining RMSprop and momentum