Introduction Flashcards

1
Q

What is machine learning?

A

A computer program is said to learn from experience E with respect to some tasks T and performance meassure P, if its performance at task T with respect to P improves with experience E.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Will regression with the L1 or L2 loss be most affected by outliers?

A

L2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What can we do in the linear regression setting if XX^T isn’t invertible (We have more features than datapoints)?

A

Add a regularization term to the loss function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the generalization gap?

A

The difference between generalization (test) and training error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why do we usually not use the step (0-1) activation function?

A

Hard to optimize since it isn’t continues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the approximation error?

A

The error coming from the function approximation, if the function is too “simple” it can’t approximatet the underlying function perfectly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is estimation error (or generalization gap)?

A

The error from using empircal mean instead of true expectation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is optimization error?

A

The error resulting from finding a local, not global minimizer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why would we use SGD instead of GD optimizer?

A

The computational and memory complexity is smaller for SGD. ( Also SGD can lead to more exploring).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the advantage of backprop?

A

Backprop describes how to calculate the gradient in an algorithm way and it often avoids calculating the same thing several times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can we guaranty that a minima is a global minima

A

If the function is convex and the Hessian is positive semi-definit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why does gradient descent work in neural networks, even though they are non-convex?

A

The difference between local and global minima is usually quite small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is the optimization affected by small and large step size?

A

Small:
Slower inital convergence, more accurate end result
Large:
faster inital convergence, less accurate end result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the probabilistic interpretation of L2 regularization

A

MAP with normal prior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the probabilistic interpretation of L1 regularization?

A

MAP with laplacian prior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What do Dropout, Early stopping, data augmentation and normalization techniques have in common?

A

All of them have regularizing effects.

17
Q

What kind of regularization do we get by adding noise to the input?

A

Weight decay (L2)

18
Q

What is SGD with momentum?

A

We add a fraction of the last step to the gradient descent.

19
Q

What is the idea behind Adagrad?

A

if the gradient along dimension i is given by g_i, we divide the update by sqrt(g_i^2). This will result in different step size for different dimensions.

20
Q

What is the idea behind RMSprop (AdaDelta)?

A

The same as Adagrad, but using the average decaying average of gradients instead?

21
Q

What is the idea behind Adam

A

Combining RMSprop and momentum