Lesson 2 - From shallow to deep neural networks Flashcards
What does an artificial neuron do?
Define a linear (afine) projection of the data)
Why do we choose to add neurons in parallel (and by such creating layers)?
- Highly optimizable
–> algorithmically via smart matrix multiplication
–> hardware-wise via GPU’s, TPU’s - enable powerful compositions
What does it mean when you have two neurons in the same layer that have exactly the same weights? In other words, what does it mean when you have repitition?
Something is not right, there are things redundant
How do you define an epoch
An epoch is when the model has seen the entire data set once
Why don’t we still use the perceptron learning algorithm?
It had some requirements:
- examples (with labels)
- a way to evaluate the goodness of the model (measure performance)
- stopping criteria
What is the Sigmoid function?
It is an activation function that projects your input to a value in the range of [0, 1]
- Introduces non-linear behavior
- Scaled output [0, 1]
- Simple derivatives
- Saturates: vanishing derivatives
- Applied point-wise
Very smooth from -3 to 3 then basically flat
What are some characteristics of the Cross Entropy function?
It is a loss function
- negation of logarithm of probability of correct prediction (on the entire dataset)
- composable with Sigmoid
- Numerically unstable (but there is a way to address this)
- additive with respect to samples
What is the basic order of layers in a basic neural network?
first an x amount of the following combination:
(linear layer, activation function)
then at the end you have the loss function
What are some characteristics of the Softmax function?
- generalization of the Sigmoid
- does not work properly with sparse outputs
- does not scale properly with respect to the number of classes (k)
When you look at the characteristics of the softmax function, it is not (that) better than Sigmoid. Why do we use it then?
Because combined with cross entropy, you have the following:
- generalization of the sigmoid
- becomes numerically stable
- simple, yet powerful (92% handwritten digit recognition)
In the early days of deep neural networks, it was found that adding more and more layers did not lead to better results. At some point, the loss function became flat. What was the reason?
By using Sigmoid, you can get the effect of vanishing gradient
(Gradient close to 0)
A new activation function was introduced: ReLU. What does it do and what do you know about it?
ReLU = Rectifier Linear Unit, takes max of input and 0
- point wise operation
- not linear but piece-wise linear
- cut the space into polyhedra
- dead neurons can occur
- not differentiable at 0
- derivatives do not vanish
What can be the problem of having (too many) dead neurons?
You loose performance, you cannot make an informative decision anymore
Dead neurons: when a flow of activation enters but comes out as 0
What is the Universal Approximation Theorem?
Given a continuous function from the hypercube to a single real value.
A large network can approximate (up to some error epsilon), not represent, any smooth function.
–> does not provide guarantees over the “learnability of such network
–> size of network grows exponentially w.r.t the input dimension
What is the difference between deeper and wider architectures? Why would you choose one over the other?
Growth of the partitioning space:
- exponential by depth
- polynomial by width