lecture 6 - DNNs Flashcards
How do shallow networks differ from deep networks?
Shallow networks have fewer hidden layers (1 or 2), while deep networks have many hidden layers (e.g., 5 or more).
Why are deep neural networks considered a separate research area?
Deep networks come with additional complexities in learning. Traditional gradient descent doesn’t work well for them, requiring specialized techniques.
Why are deep networks necessary for tasks like image recognition and language processing?
The additional layers and nodes in deep networks enable the creation of more complex features, which are essential for these tasks.
Why use deep neural networks instead of shallow networks that are already universal approximators?
Deep neural networks can compute many functions with fewer layers, whereas shallow networks require exponentially more hidden units to achieve the same level of approximation.
How do deep and shallow networks compare in terms of required resources for complex problems?
- Deep networks require only a logarithmic number of neurons O(log n) for certain problems.
- Shallow networks require exponentially many neurons
O(2^n) for the same problems.
- Deep networks exhibit logarithmic growth, whereas shallow networks exhibit exponential growth in the number of required neurons.
What is an example of a problem where deep networks outperform shallow networks in terms of efficiency?
For problems like x_1 XOR x_2 XOR x_3 XOR x_4, deep networks can solve the problem with far fewer neurons than shallow networks
What is forward propagation in DNNs?
Forward propagation is the process of passing input data through the layers of a neural network to compute the final output.
What are the two main operations performed by each layer in a neural network during forward propagation?
- linear transformation: Inputs from the previous layer are multiplied by a weight matrix and added to a bias vector, resulting in the pre-activation values.
- non-linear activation: The pre-activation values are passed through an activation function to compute the final output layer
Why is a non-linear activation function important in forward propagation?
The non-linear activation function introduces non-linearity, enabling the neural network to learn and model complex patterns.
What is backward propagation in DNNs?
Backward propagation is the process used to calculate the gradient of the loss function with respect to the weights and biases in the network. The gradients are used to update the parameters via optimization (e.g., gradient descent).
What is the starting point for backward propagation?
The starting point is the derivative of the loss with respect to the output, da^[l], in the final layer L.
What are the key steps in computing the gradient flow for each layer during backward propagation?
- compute the gradient of the pre-activation values (dz^[l])
- compute the gradient of the weights (dW^[l])
- compute the gradient of the biases (db^[l])
- backpropagate the error to the previous layer (da^[l-1])
What are hyperparameters in a neural network?
Hyperparameters are parameters set before training a neural network, influencing the learning process and performance of the model.
How does the number of layers affect a neural network?
- The number of layers refers to the depth of the network.
- More layers allow the network to learn complex hierarchical features but increase computational complexity and the risk of overfitting.
What does the number of units per layer determine in a neural network?
The number of units per layer determines the width of the network. More units increase the model’s capacity but may also lead to overfitting.
Why is learning rate an important hyperparameter in gradient descent?
- The learning rate controls the step size during gradient descent.
- A high learning rate might overshoot the minimum, while a low learning rate can make training slow.
What role do activation functions play in neural networks?
- Activation functions define the transformation applied to the input at each layer.
- Common choices include ReLU, Sigmoid, and Tanh.
Important hyperparameters
- layers
- units
- learning rate
- activation functions
- batch size
- weight initialization strategies
- dropout rate
- optimizers
How should datasets be divided to evaluate and optimize hyperparameters?
- Training set: Used to train the model.
- Development set: Used to tune the hyperparameters.
- Test set: Used to evaluate the performance of the algorithm.
What are bias and variance in the context of neural networks?
- Bias: Error due to overly simplistic assumptions in the model. High bias indicates underfitting.
- Variance: Error due to high sensitivity to small fluctuations in the training data. High variance indicates overfitting.
Why should you address bias before variance in a model?
- High bias indicates that the model poorly approximates the data regardless of variance.
- Then, decrease the variance
- Since decreasing variance can increase bias, it’s crucial to then check the bias again
What are the characteristics of a model with high variance?
- A high-variance model performs well on the training set but poorly on the validation/dev set.
- Example: Train set error = 1%, Dev set error = 11%.
How can you reduce high variance in a neural network?
- Increase the size of the training data.
- Apply regularization.
- Adjust the network architecture (e.g., add more layers or units).
What are the characteristics of a model with high bias?
- A high-bias model performs poorly on both the training and validation/dev sets.
- Example: Train set error = 15%, Dev set error = 16%.
How can you reduce high bias in a neural network?
- Use a larger network.
- Train for a longer duration.
- Adjust the architecture (e.g., add more layers or units).
What are the characteristics of a model with both high bias and high variance?
- The model has large errors on both the training set and an even larger error on the dev set, indicating that it neither fits the training data well nor generalizes.
- Example: Train set error = 15%, Dev set error = 30%.
How can you address high bias and high variance?
Address the bias first (by increasing model capacity or training longer) and then reduce variance (by applying regularization or using more data).
What are the characteristics of a model with low bias and low variance?
- This is the ideal case where both training and dev errors are low, with the dev set error slightly higher than the training set error.
- Example: Train set error = 0.5%, Dev set error = 1%.
high variance plot
circles exactly around all the targets
high bias plot
rigid straight/slanted line that captures most of the points
high bias high variance plot
straight line that makes an exception for outlying point
low bias low variance plot
circles around most of the target points but leaves the outlying point out.
What is dropout regularization in neural networks?
Dropout is a regularization technique used to prevent overfitting by randomly dropping out a proportion of neurons during the training process. The dropped-out neurons do not participate in forward and backward propagation.
How does dropout regularization help in preventing overfitting?
By randomly turning off neurons, dropout forces the network to distribute the learned weights across different neurons, preventing any single neuron from becoming too dominant and improving the model’s generalization ability.
Why is it impractical to use one regularization term for both layers in deeper neural networks?
As the number of layers increases, using a single regularization term becomes impractical because it adds too many hyperparameters to tune. Dropout addresses this issue without adding extra hyperparameters.
What do L2 regularization and dropout target?
- L2 Regularization: Prevents overly large weights.
- Dropout: Prevents reliance on specific neurons.
What is early stopping in neural network training?
- Alternative to regularization
- Early stopping involves halting the training process at the point where the validation error is smallest, preventing overfitting by not allowing the model to train further once it starts overfitting.
How does data augmentation help in reducing overfitting?
Data augmentation artificially increases the size and diversity of the training dataset by applying transformations (e.g., rotations, translations) to the data, making the model more robust to variations.
What is the key benefit of data augmentation?
It helps models learn to ignore irrelevant variations, such as minor shifts or distortions, and focus on the important features that define the object or class.
Why is input normalization important for optimization using gradient descent?
Without normalization, input features with varying scales can lead to elongated cost function contours, making gradient descent inefficient as it must zig-zag through narrow valleys to converge.
How does normalizing the inputs affect the cost function contours?
Normalizing inputs (scaling to have a mean of 0 and variance of 1) results in circular contours, allowing gradient descent to converge faster by following a more direct optimization path.