Introduction Flashcards
What is machine learning?
A computer program is said to learn from experience E with respect to some tasks T and performance meassure P, if its performance at task T with respect to P improves with experience E.
Will regression with the L1 or L2 loss be most affected by outliers?
L2.
What can we do in the linear regression setting if XX^T isn’t invertible (We have more features than datapoints)?
Add a regularization term to the loss function.
What is the generalization gap?
The difference between generalization (test) and training error.
Why do we usually not use the step (0-1) activation function?
Hard to optimize since it isn’t continues.
What is the approximation error?
The error coming from the function approximation, if the function is too “simple” it can’t approximatet the underlying function perfectly.
What is estimation error (or generalization gap)?
The error from using empircal mean instead of true expectation
What is optimization error?
The error resulting from finding a local, not global minimizer.
Why would we use SGD instead of GD optimizer?
The computational and memory complexity is smaller for SGD. ( Also SGD can lead to more exploring).
What is the advantage of backprop?
Backprop describes how to calculate the gradient in an algorithm way and it often avoids calculating the same thing several times.
How can we guaranty that a minima is a global minima
If the function is convex and the Hessian is positive semi-definit.
Why does gradient descent work in neural networks, even though they are non-convex?
The difference between local and global minima is usually quite small.
How is the optimization affected by small and large step size?
Small:
Slower inital convergence, more accurate end result
Large:
faster inital convergence, less accurate end result
What is the probabilistic interpretation of L2 regularization
MAP with normal prior
What is the probabilistic interpretation of L1 regularization?
MAP with laplacian prior