lecture 6 - DNNs Flashcards
How do shallow networks differ from deep networks?
Shallow networks have fewer hidden layers (1 or 2), while deep networks have many hidden layers (e.g., 5 or more).
Why are deep neural networks considered a separate research area?
Deep networks come with additional complexities in learning. Traditional gradient descent doesn’t work well for them, requiring specialized techniques.
Why are deep networks necessary for tasks like image recognition and language processing?
The additional layers and nodes in deep networks enable the creation of more complex features, which are essential for these tasks.
Why use deep neural networks instead of shallow networks that are already universal approximators?
Deep neural networks can compute many functions with fewer layers, whereas shallow networks require exponentially more hidden units to achieve the same level of approximation.
How do deep and shallow networks compare in terms of required resources for complex problems?
- Deep networks require only a logarithmic number of neurons O(log n) for certain problems.
- Shallow networks require exponentially many neurons
O(2^n) for the same problems.
- Deep networks exhibit logarithmic growth, whereas shallow networks exhibit exponential growth in the number of required neurons.
What is an example of a problem where deep networks outperform shallow networks in terms of efficiency?
For problems like x_1 XOR x_2 XOR x_3 XOR x_4, deep networks can solve the problem with far fewer neurons than shallow networks
What is forward propagation in DNNs?
Forward propagation is the process of passing input data through the layers of a neural network to compute the final output.
What are the two main operations performed by each layer in a neural network during forward propagation?
- linear transformation: Inputs from the previous layer are multiplied by a weight matrix and added to a bias vector, resulting in the pre-activation values.
- non-linear activation: The pre-activation values are passed through an activation function to compute the final output layer
Why is a non-linear activation function important in forward propagation?
The non-linear activation function introduces non-linearity, enabling the neural network to learn and model complex patterns.
What is backward propagation in DNNs?
Backward propagation is the process used to calculate the gradient of the loss function with respect to the weights and biases in the network. The gradients are used to update the parameters via optimization (e.g., gradient descent).
What is the starting point for backward propagation?
The starting point is the derivative of the loss with respect to the output, da^[l], in the final layer L.
What are the key steps in computing the gradient flow for each layer during backward propagation?
- compute the gradient of the pre-activation values (dz^[l])
- compute the gradient of the weights (dW^[l])
- compute the gradient of the biases (db^[l])
- backpropagate the error to the previous layer (da^[l-1])
What are hyperparameters in a neural network?
Hyperparameters are parameters set before training a neural network, influencing the learning process and performance of the model.
How does the number of layers affect a neural network?
- The number of layers refers to the depth of the network.
- More layers allow the network to learn complex hierarchical features but increase computational complexity and the risk of overfitting.
What does the number of units per layer determine in a neural network?
The number of units per layer determines the width of the network. More units increase the model’s capacity but may also lead to overfitting.
Why is learning rate an important hyperparameter in gradient descent?
- The learning rate controls the step size during gradient descent.
- A high learning rate might overshoot the minimum, while a low learning rate can make training slow.
What role do activation functions play in neural networks?
- Activation functions define the transformation applied to the input at each layer.
- Common choices include ReLU, Sigmoid, and Tanh.
Important hyperparameters
- layers
- units
- learning rate
- activation functions
- batch size
- weight initialization strategies
- dropout rate
- optimizers
How should datasets be divided to evaluate and optimize hyperparameters?
- Training set: Used to train the model.
- Development set: Used to tune the hyperparameters.
- Test set: Used to evaluate the performance of the algorithm.
What are bias and variance in the context of neural networks?
- Bias: Error due to overly simplistic assumptions in the model. High bias indicates underfitting.
- Variance: Error due to high sensitivity to small fluctuations in the training data. High variance indicates overfitting.
Why should you address bias before variance in a model?
- High bias indicates that the model poorly approximates the data regardless of variance.
- Then, decrease the variance
- Since decreasing variance can increase bias, it’s crucial to then check the bias again
What are the characteristics of a model with high variance?
- A high-variance model performs well on the training set but poorly on the validation/dev set.
- Example: Train set error = 1%, Dev set error = 11%.
How can you reduce high variance in a neural network?
- Increase the size of the training data.
- Apply regularization.
- Adjust the network architecture (e.g., add more layers or units).
What are the characteristics of a model with high bias?
- A high-bias model performs poorly on both the training and validation/dev sets.
- Example: Train set error = 15%, Dev set error = 16%.