Introduction Flashcards

Question 1

Q

What is machine learning?

Answer

A

A computer program is said to learn from experience E with respect to some tasks T and performance meassure P, if its performance at task T with respect to P improves with experience E.

Question 2

Q

Will regression with the L1 or L2 loss be most affected by outliers?

Question 3

Q

What can we do in the linear regression setting if XX^T isn’t invertible (We have more features than datapoints)?

Answer

A

Add a regularization term to the loss function.

Question 4

Q

What is the generalization gap?

Answer

A

The difference between generalization (test) and training error.

Question 5

Q

Why do we usually not use the step (0-1) activation function?

Answer

A

Hard to optimize since it isn’t continues.

Question 6

Q

What is the approximation error?

Answer

A

The error coming from the function approximation, if the function is too “simple” it can’t approximatet the underlying function perfectly.

Question 7

Q

What is estimation error (or generalization gap)?

Answer

A

The error from using empircal mean instead of true expectation

Question 8

Q

What is optimization error?

Answer

A

The error resulting from finding a local, not global minimizer.

Question 9

Q

Why would we use SGD instead of GD optimizer?

Answer

A

The computational and memory complexity is smaller for SGD. ( Also SGD can lead to more exploring).

Question 10

Q

What is the advantage of backprop?

Answer

A

Backprop describes how to calculate the gradient in an algorithm way and it often avoids calculating the same thing several times.

Question 11

Q

How can we guaranty that a minima is a global minima

Answer

A

If the function is convex and the Hessian is positive semi-definit.

Question 12

Q

Why does gradient descent work in neural networks, even though they are non-convex?

Answer

A

The difference between local and global minima is usually quite small.

Question 13

Q

How is the optimization affected by small and large step size?

Answer

A

Small:
Slower inital convergence, more accurate end result
Large:
faster inital convergence, less accurate end result

Question 14

Q

What is the probabilistic interpretation of L2 regularization

Answer

A

MAP with normal prior

Question 15

Q

What is the probabilistic interpretation of L1 regularization?

Answer

A

MAP with laplacian prior

Question 16

Q

What do Dropout, Early stopping, data augmentation and normalization techniques have in common?

Answer

Study These Flashcards

A

All of them have regularizing effects.

Question 17

Q

What kind of regularization do we get by adding noise to the input?

Answer

Study These Flashcards

A

Weight decay (L2)

Question 18

Q

What is SGD with momentum?

Answer

Study These Flashcards

A

We add a fraction of the last step to the gradient descent.

Question 19

Q

What is the idea behind Adagrad?

Answer

Study These Flashcards

A

if the gradient along dimension i is given by g_i, we divide the update by sqrt(g_i^2). This will result in different step size for different dimensions.

Question 20

Q

What is the idea behind RMSprop (AdaDelta)?

Answer

Study These Flashcards

A

The same as Adagrad, but using the average decaying average of gradients instead?

Question 21

Q

What is the idea behind Adam

Answer

Study These Flashcards

A

Combining RMSprop and momentum

Introduction Flashcards

(21 cards)