Lesson 2 - From shallow to deep neural networks Flashcards

1
Q

What does an artificial neuron do?

A

Define a linear (afine) projection of the data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we choose to add neurons in parallel (and by such creating layers)?

A
  • Highly optimizable
    –> algorithmically via smart matrix multiplication
    –> hardware-wise via GPU’s, TPU’s
  • enable powerful compositions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does it mean when you have two neurons in the same layer that have exactly the same weights? In other words, what does it mean when you have repitition?

A

Something is not right, there are things redundant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you define an epoch

A

An epoch is when the model has seen the entire data set once

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why don’t we still use the perceptron learning algorithm?

A

It had some requirements:

  • examples (with labels)
  • a way to evaluate the goodness of the model (measure performance)
  • stopping criteria
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Sigmoid function?

A

It is an activation function that projects your input to a value in the range of [0, 1]

  • Introduces non-linear behavior
  • Scaled output [0, 1]
  • Simple derivatives
  • Saturates: vanishing derivatives
  • Applied point-wise

Very smooth from -3 to 3 then basically flat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some characteristics of the Cross Entropy function?

A

It is a loss function

  • negation of logarithm of probability of correct prediction (on the entire dataset)
  • composable with Sigmoid
  • Numerically unstable (but there is a way to address this)
  • additive with respect to samples
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the basic order of layers in a basic neural network?

A

first an x amount of the following combination:
(linear layer, activation function)
then at the end you have the loss function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some characteristics of the Softmax function?

A
  • generalization of the Sigmoid
  • does not work properly with sparse outputs
  • does not scale properly with respect to the number of classes (k)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When you look at the characteristics of the softmax function, it is not (that) better than Sigmoid. Why do we use it then?

A

Because combined with cross entropy, you have the following:
- generalization of the sigmoid
- becomes numerically stable
- simple, yet powerful (92% handwritten digit recognition)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In the early days of deep neural networks, it was found that adding more and more layers did not lead to better results. At some point, the loss function became flat. What was the reason?

A

By using Sigmoid, you can get the effect of vanishing gradient
(Gradient close to 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A new activation function was introduced: ReLU. What does it do and what do you know about it?

A

ReLU = Rectifier Linear Unit, takes max of input and 0

  • point wise operation
  • not linear but piece-wise linear
  • cut the space into polyhedra
  • dead neurons can occur
  • not differentiable at 0
  • derivatives do not vanish
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What can be the problem of having (too many) dead neurons?

A

You loose performance, you cannot make an informative decision anymore

Dead neurons: when a flow of activation enters but comes out as 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Universal Approximation Theorem?

A

Given a continuous function from the hypercube to a single real value.
A large network can approximate (up to some error epsilon), not represent, any smooth function.

–> does not provide guarantees over the “learnability of such network
–> size of network grows exponentially w.r.t the input dimension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between deeper and wider architectures? Why would you choose one over the other?

A

Growth of the partitioning space:
- exponential by depth
- polynomial by width

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is gradient descent?

A

The search for the optimal weights.

Gradient descent tells us how to update our parameters as we train, so we can minimize the loss.

After every epoch, the weights are updated and the idea is that eventually, we reach the lowest point that is possible for loss - which means that we’ve gotten the best performance based on the training data.

The gradient information will tell us what is the direction if higher decrease or lower decrease. In case of the loss function, we want to minimize it.

17
Q

Tell some characteristics of the gradient descent

A
  • works for any smooth function
  • less guarantees for some non-smooth targets
  • converges to local optimum
  • critical effect of the learning rate