Lesson 2 - From shallow to deep neural networks Flashcards

Question 1

Q

What does an artificial neuron do?

Answer

A

Define a linear (afine) projection of the data)

Question 2

Q

Why do we choose to add neurons in parallel (and by such creating layers)?

Answer

A

Highly optimizable
–> algorithmically via smart matrix multiplication
–> hardware-wise via GPU’s, TPU’s
enable powerful compositions

Question 3

Q

What does it mean when you have two neurons in the same layer that have exactly the same weights? In other words, what does it mean when you have repitition?

Answer

A

Something is not right, there are things redundant

Question 4

Q

How do you define an epoch

Answer

A

An epoch is when the model has seen the entire data set once

Question 5

Q

Why don’t we still use the perceptron learning algorithm?

Answer

A

It had some requirements:

examples (with labels)
a way to evaluate the goodness of the model (measure performance)
stopping criteria

Question 6

Q

What is the Sigmoid function?

Answer

A

It is an activation function that projects your input to a value in the range of [0, 1]

Introduces non-linear behavior
Scaled output [0, 1]
Simple derivatives
Saturates: vanishing derivatives
Applied point-wise

Very smooth from -3 to 3 then basically flat

Question 7

Q

What are some characteristics of the Cross Entropy function?

Answer

A

It is a loss function

negation of logarithm of probability of correct prediction (on the entire dataset)
composable with Sigmoid
Numerically unstable (but there is a way to address this)
additive with respect to samples

Question 8

Q

What is the basic order of layers in a basic neural network?

Answer

A

first an x amount of the following combination:
(linear layer, activation function)
then at the end you have the loss function

Question 9

Q

What are some characteristics of the Softmax function?

Answer

A

generalization of the Sigmoid
does not work properly with sparse outputs
does not scale properly with respect to the number of classes (k)

Question 10

Q

When you look at the characteristics of the softmax function, it is not (that) better than Sigmoid. Why do we use it then?

Answer

A

Because combined with cross entropy, you have the following:
- generalization of the sigmoid
- becomes numerically stable
- simple, yet powerful (92% handwritten digit recognition)

Question 11

Q

In the early days of deep neural networks, it was found that adding more and more layers did not lead to better results. At some point, the loss function became flat. What was the reason?

Answer

A

By using Sigmoid, you can get the effect of vanishing gradient
(Gradient close to 0)

Question 12

Q

A new activation function was introduced: ReLU. What does it do and what do you know about it?

Answer

A

ReLU = Rectifier Linear Unit, takes max of input and 0

point wise operation
not linear but piece-wise linear
cut the space into polyhedra
dead neurons can occur
not differentiable at 0
derivatives do not vanish

Question 13

Q

What can be the problem of having (too many) dead neurons?

Answer

A

You loose performance, you cannot make an informative decision anymore

Dead neurons: when a flow of activation enters but comes out as 0

Question 14

Q

What is the Universal Approximation Theorem?

Answer

A

Given a continuous function from the hypercube to a single real value.
A large network can approximate (up to some error epsilon), not represent, any smooth function.

–> does not provide guarantees over the “learnability of such network
–> size of network grows exponentially w.r.t the input dimension

Question 15

Q

What is the difference between deeper and wider architectures? Why would you choose one over the other?

Answer

A

Growth of the partitioning space:
- exponential by depth
- polynomial by width

Question 16

Q

What is gradient descent?

Answer

Study These Flashcards

A

The search for the optimal weights.

Gradient descent tells us how to update our parameters as we train, so we can minimize the loss.

After every epoch, the weights are updated and the idea is that eventually, we reach the lowest point that is possible for loss - which means that we’ve gotten the best performance based on the training data.

The gradient information will tell us what is the direction if higher decrease or lower decrease. In case of the loss function, we want to minimize it.

Question 17

Q

Tell some characteristics of the gradient descent

Answer

Study These Flashcards

A

works for any smooth function
less guarantees for some non-smooth targets
converges to local optimum
critical effect of the learning rate

Lesson 2 - From shallow to deep neural networks Flashcards

(17 cards)