Neural networks Flashcards

1
Q

What is the Universal Approximation Theorem and why is it significant in neural networks?

A

The Universal Approximation Theorem states that a neural network with at least one hidden layer can approximate any continuous function to any desired level of accuracy, given enough neurons. It is significant because it ensures that neural networks are highly flexible and capable of learning complex functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain the role of activation functions in neural networks and compare the performance of different activation functions (ReLU, sigmoid, etc.).

A

Activation functions introduce non-linearity into the network, allowing it to learn complex patterns. ReLU is computationally efficient and mitigates the vanishing gradient problem, whereas sigmoid and tanh can suffer from vanishing gradients, leading to slow learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does backpropagation work, and what role does the chain rule play in its implementation?

A

Backpropagation computes the gradient of the loss function with respect to each weight by applying the chain rule. It allows for efficient updating of weights by propagating the error from the output layer back to the earlier layers, thus adjusting the weights iteratively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is gradient descent, and how is it used to minimize the loss function in neural network training?

A

Gradient descent is an optimization algorithm that minimizes the loss function by iteratively updating the weights in the direction of the negative gradient of the loss. It adjusts the parameters to reduce the error between predictions and true labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the concept of learning rate in neural networks. Why is it important to select an appropriate learning rate?

A

The learning rate controls the step size in gradient descent. A high learning rate may lead to overshooting the minimum, while a low rate can result in slow convergence. It is critical to select a learning rate that balances speed and stability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is stochastic gradient descent (SGD) and how does it differ from traditional gradient descent?

A

SGD updates weights based on a single training example at each iteration, while traditional gradient descent updates weights after computing the gradient for the entire dataset. SGD is faster but more noisy, while batch gradient descent is more stable but slower.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the vanishing gradient problem in deep neural networks, and why does it occur?

A

The vanishing gradient problem occurs when gradients become too small to effectively update the weights, especially in deep networks. This happens with activation functions like sigmoid, where derivatives approach zero for large or small inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain how kernel smoothing and local linear regression can be viewed as nonparametric methods. How are these methods similar to neural networks?

A

Both kernel smoothing and local linear regression are nonparametric methods that estimate a function using weighted averages of nearby data points. Neural networks also locally approximate complex functions using learned parameters and activation functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is overfitting in neural networks, and what techniques can be employed to prevent it?

A

Overfitting occurs when the model learns noise in the training data, resulting in poor generalization to new data. Techniques like dropout, early stopping, and regularization (L1/L2) can prevent overfitting by simplifying the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe the structure and function of a single hidden layer feedforward neural network.

A

A single hidden layer feedforward neural network consists of an input layer, a hidden layer, and an output layer. Each neuron in the hidden layer computes a weighted sum of inputs, applies an activation function, and passes the result to the output layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does backpropagation compute gradients efficiently and why is it referred to as ‘error backpropagation’?

A

Backpropagation calculates gradients efficiently using the chain rule to propagate errors backward through the network. It is called ‘error backpropagation’ because it starts by calculating the error at the output and then traces it back through the layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the impact of using dropout in neural networks, and how does it help mitigate overfitting?

A

Dropout randomly removes neurons during training, forcing the network to rely on different subsets of neurons. This prevents overfitting by making the network more robust and less dependent on specific neurons.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is the loss function in a neural network defined for regression tasks, and how does gradient descent minimize it?

A

In regression tasks, the loss function is typically defined as the mean squared error between the predicted and actual values. Gradient descent minimizes this error by adjusting the weights to reduce the difference between predictions and true values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the significance of the sigmoid function in binary classification tasks and why it might cause the vanishing gradient problem.

A

The sigmoid function is commonly used in binary classification tasks to map predictions to probabilities. However, it can cause the vanishing gradient problem because its derivative becomes very small for large input values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does the learning rate schedule (adaptive learning rates) improve the performance of gradient descent algorithms?

A

Adaptive learning rates, as used in algorithms like Adam, adjust the learning rate dynamically based on the gradient’s behavior. This improves convergence by taking larger steps when possible and smaller steps when needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the main purpose of using mini-batch gradient descent, and how does it balance between SGD and batch gradient descent?

A

Mini-batch gradient descent balances the trade-offs between SGD and batch gradient descent by updating weights based on a small subset of the data. This reduces noise compared to SGD and is computationally more efficient than full batch updates.

17
Q

Compare the differences between kernel smoothing and local linear regression, and explain how these concepts relate to nonparametric regression.

A

Kernel smoothing uses a weighted average of nearby points to estimate a function, while local linear regression fits a linear model locally. Both methods are nonparametric and offer flexible ways to model relationships without assuming a specific form.

18
Q

How does the Universal Approximation Theorem ensure that a neural network can approximate any continuous function, given sufficient neurons in the hidden layer?

A

The Universal Approximation Theorem guarantees that, with enough neurons in the hidden layer, a neural network can approximate any continuous function. This highlights the power of neural networks to model complex relationships.

19
Q

Explain how the concept of overfitting affects the capacity of deep neural networks and how early stopping can be used as a preventive measure.

A

Overfitting in deep neural networks occurs when the model captures noise in the training data. Early stopping monitors performance on a validation set and halts training when the model starts to overfit, improving generalization.

20
Q

What role do bias and variance play in determining the generalization error of neural networks, and how can regularization techniques help?

A

Bias refers to the error from oversimplified models, while variance refers to the error from overly complex models. Regularization techniques like L1 and L2 reduce variance by penalizing large weights, helping to strike a balance between bias and variance.