DL-01 - Introduction (+ impl) Flashcards

1
Q

DL-01a - Introduction

Who was behind Alexnet? (1 + 2)

A

Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

DL-01a - Introduction

What’s the formula for the sigmoid function?

A

(See image)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

DL-01a - Introduction

What formula is this? (See image)

A

Sigmoid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

DL-01a - Introduction

Why is the sigmoid activation function less used these days? (2)

A
  • Vanishing gradient problem
  • Non-zero centered output (range 0-1)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

DL-01a - Introduction

Why is the non-zero centered output of a sigmoid function a problem?

A

It causes zig-zagging dynamics during optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

DL-01a - Introduction

What is one cause of vanishing gradients with the sigmoid function?

A

Saturation behavior at either end of the tails.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

DL-01a - Introduction

What formula is this? (See image)

A

Tanh.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

DL-01a - Introduction

What’s the formula for tanh?

A

(See image)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

DL-01a - Introduction

How is tanh related to sigmoid?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

DL-01a - Introduction

How does tanh and sigmoid compare? (3)

A
  • Tanh solves the “zero centered” problem.
  • Both gradients still saturate.
  • Tanh is generally preferred over sigmoid.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

DL-01a - Introduction

What are the pros of the ReLU function? (4)

A
  • No vanishing gradient
  • Fast convergence
  • Simple implementation
  • Better convergence performance than sigmoid
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

DL-01a - Introduction

What are the cons of the ReLU function? (1)

A

“Dying ReLU” problem with large gradient flows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

DL-01a - Introduction

What is parametric/leaky ReLU?

A

ReLU with slope in negative region. (See image)

  • Parametric: Has learnable param.
  • Leaky: Pre-defined param.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

DL-01a - Introduction

What’s the formula for PReLU? Is alpha a param or a hyperparam?

A

Alpha is a param and is learned during training.

(See image)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DL-01a - Introduction

What’s the formula for leaky ReLU? Is alpha a param or a hyperparam?

A

Alpha is a hyperparam, typically fixed at 0.01.

(See image)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DL-01a - Introduction

What activation function is this? (See image)

A

Softmax

17
Q

DL-01a - Introduction

What’s the formula for softmax?

A

(See image)

18
Q

DL-01a - Introduction

How is momentum used in gradient descent?

A

Momentum is used in gradient descent to accelerate convergence by adding a fraction of the previous update to the current update.

19
Q

DL-01a - Introduction

What does momentum help with?

A

Overcoming local minima and oscillations.

20
Q

DL-01a - Introduction

How are momentum and Nesterov different?

A
  • Momentum speeds up gradient descent
  • Nesterov Momentum adds a corrective factor for better approximation.
21
Q

DL-01a - Introduction

What is RMSprop?

A

RMSprop is an adaptive learning rate optimization algorithm for gradient descent in deep learning models.

22
Q

DL-01a - Introduction

How does RMSprop work?

A

RMSprop works by adapting the learning rate for each weight based on the magnitudes of their gradients, using a running average of squared gradients.

23
Q

DL-01a - Introduction

How does ADAM work?

A

The ADAM optimizer works by adaptively adjusting the learning rate for each parameter using the first and second moments of the gradients.