Loss and Learning Mechanisms Flashcards

1
Q

Why are activation functions necessary in neural networks?

A

They introduce non-linearity, allowing neural networks to learn complex patterns and perform better on real-world tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a linear activation function?

A

A function where the output is a linear transformation of the input, limiting the model’s ability to learn complex patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the benefits of non-linear activation functions?

A

They enable backpropagation and allow multiple layers to learn complex representations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Sigmoid activation function’s main drawback?

A

It suffers from the vanishing gradient problem, where values far from 0 have very small gradients, hindering learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the advantage of the Tanh function over Sigmoid?

A

It outputs values between -1 and 1, making it centered around zero and allowing faster convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is ReLU the most popular activation function?

A

It is computationally efficient and avoids vanishing gradients by outputting zero for negative values and the input itself for positive values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the dying ReLU problem?

A

When neurons output zero for all inputs due to negative values, preventing learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does Leaky ReLU solve the dying ReLU problem?

A

It allows a small, non-zero gradient for negative inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the purpose of a loss function?

A

It quantifies how well a neural network’s predictions match the ground truth, guiding the training process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is loss computed for an entire training set?

A

By averaging individual loss values over all training examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Log Loss?

A

Also known as cross-entropy loss, it penalizes incorrect classifications logarithmically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Binary Cross-Entropy loss?

A

It measures the difference between actual and predicted class probabilities for binary classification problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Multi-class Cross-Entropy loss?

A

It extends Binary Cross-Entropy to multiple classes by summing losses over all possible classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the steps of gradient-based optimization?

A
  1. Run forward pass
  2. Compute loss
  3. Compute gradients
  4. Update weights using gradients
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why do we use gradients in optimization?

A

Gradients indicate the steepest ascent direction; moving in the opposite direction minimizes loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between gradient descent and stochastic gradient descent (SGD)?

A

Gradient descent updates weights using the entire dataset, while SGD updates weights per sample, making it computationally efficient but noisier.

17
Q

What is mini-batch stochastic gradient descent?

A

A compromise between gradient descent and SGD, where updates are made using small batches instead of single samples.

18
Q

What is overfitting in deep learning?

A

When a model learns training data too well, including noise, leading to poor generalization.

19
Q

What is dropout, and how does it help?

A

Dropout randomly deactivates neurons during training to prevent overfitting.

20
Q

How does DropConnect differ from dropout?

A

Instead of deactivating neurons, DropConnect removes individual connections, providing stronger regularization.

21
Q

What are the key hyperparameters in neural networks?

A

Learning rate, batch size, number of epochs, and number of hidden neurons.

22
Q

What is the role of the learning rate?

A

It determines the step size for updating weights during training.

23
Q

How does momentum improve gradient descent?

A

It accelerates learning by accumulating past gradients to prevent oscillations.

24
Q

What are popular optimizers other than SGD?

A

Adam and RMSprop, which use adaptive learning rates for efficient optimization.

25
Q

What is the role of one-hot encoding in neural networks?

A

It converts categorical labels into binary vectors, allowing models to process categorical data.