Loss and Learning Mechanisms Flashcards
Why are activation functions necessary in neural networks?
They introduce non-linearity, allowing neural networks to learn complex patterns and perform better on real-world tasks.
What is a linear activation function?
A function where the output is a linear transformation of the input, limiting the model’s ability to learn complex patterns.
What are the benefits of non-linear activation functions?
They enable backpropagation and allow multiple layers to learn complex representations.
What is the Sigmoid activation function’s main drawback?
It suffers from the vanishing gradient problem, where values far from 0 have very small gradients, hindering learning.
What is the advantage of the Tanh function over Sigmoid?
It outputs values between -1 and 1, making it centered around zero and allowing faster convergence.
Why is ReLU the most popular activation function?
It is computationally efficient and avoids vanishing gradients by outputting zero for negative values and the input itself for positive values.
What is the dying ReLU problem?
When neurons output zero for all inputs due to negative values, preventing learning.
How does Leaky ReLU solve the dying ReLU problem?
It allows a small, non-zero gradient for negative inputs.
What is the purpose of a loss function?
It quantifies how well a neural network’s predictions match the ground truth, guiding the training process.
How is loss computed for an entire training set?
By averaging individual loss values over all training examples.
What is Log Loss?
Also known as cross-entropy loss, it penalizes incorrect classifications logarithmically.
What is Binary Cross-Entropy loss?
It measures the difference between actual and predicted class probabilities for binary classification problems.
What is Multi-class Cross-Entropy loss?
It extends Binary Cross-Entropy to multiple classes by summing losses over all possible classes.
What are the steps of gradient-based optimization?
- Run forward pass
- Compute loss
- Compute gradients
- Update weights using gradients
Why do we use gradients in optimization?
Gradients indicate the steepest ascent direction; moving in the opposite direction minimizes loss.
What is the difference between gradient descent and stochastic gradient descent (SGD)?
Gradient descent updates weights using the entire dataset, while SGD updates weights per sample, making it computationally efficient but noisier.
What is mini-batch stochastic gradient descent?
A compromise between gradient descent and SGD, where updates are made using small batches instead of single samples.
What is overfitting in deep learning?
When a model learns training data too well, including noise, leading to poor generalization.
What is dropout, and how does it help?
Dropout randomly deactivates neurons during training to prevent overfitting.
How does DropConnect differ from dropout?
Instead of deactivating neurons, DropConnect removes individual connections, providing stronger regularization.
What are the key hyperparameters in neural networks?
Learning rate, batch size, number of epochs, and number of hidden neurons.
What is the role of the learning rate?
It determines the step size for updating weights during training.
How does momentum improve gradient descent?
It accelerates learning by accumulating past gradients to prevent oscillations.
What are popular optimizers other than SGD?
Adam and RMSprop, which use adaptive learning rates for efficient optimization.
What is the role of one-hot encoding in neural networks?
It converts categorical labels into binary vectors, allowing models to process categorical data.