Neural networks Flashcards
What is the Universal Approximation Theorem and why is it significant in neural networks?
The Universal Approximation Theorem states that a neural network with at least one hidden layer can approximate any continuous function to any desired level of accuracy, given enough neurons. It is significant because it ensures that neural networks are highly flexible and capable of learning complex functions.
Explain the role of activation functions in neural networks and compare the performance of different activation functions (ReLU, sigmoid, etc.).
Activation functions introduce non-linearity into the network, allowing it to learn complex patterns. ReLU is computationally efficient and mitigates the vanishing gradient problem, whereas sigmoid and tanh can suffer from vanishing gradients, leading to slow learning.
How does backpropagation work, and what role does the chain rule play in its implementation?
Backpropagation computes the gradient of the loss function with respect to each weight by applying the chain rule. It allows for efficient updating of weights by propagating the error from the output layer back to the earlier layers, thus adjusting the weights iteratively.
What is gradient descent, and how is it used to minimize the loss function in neural network training?
Gradient descent is an optimization algorithm that minimizes the loss function by iteratively updating the weights in the direction of the negative gradient of the loss. It adjusts the parameters to reduce the error between predictions and true labels.
Describe the concept of learning rate in neural networks. Why is it important to select an appropriate learning rate?
The learning rate controls the step size in gradient descent. A high learning rate may lead to overshooting the minimum, while a low rate can result in slow convergence. It is critical to select a learning rate that balances speed and stability.
What is stochastic gradient descent (SGD) and how does it differ from traditional gradient descent?
SGD updates weights based on a single training example at each iteration, while traditional gradient descent updates weights after computing the gradient for the entire dataset. SGD is faster but more noisy, while batch gradient descent is more stable but slower.
What is the vanishing gradient problem in deep neural networks, and why does it occur?
The vanishing gradient problem occurs when gradients become too small to effectively update the weights, especially in deep networks. This happens with activation functions like sigmoid, where derivatives approach zero for large or small inputs.
Explain how kernel smoothing and local linear regression can be viewed as nonparametric methods. How are these methods similar to neural networks?
Both kernel smoothing and local linear regression are nonparametric methods that estimate a function using weighted averages of nearby data points. Neural networks also locally approximate complex functions using learned parameters and activation functions.
What is overfitting in neural networks, and what techniques can be employed to prevent it?
Overfitting occurs when the model learns noise in the training data, resulting in poor generalization to new data. Techniques like dropout, early stopping, and regularization (L1/L2) can prevent overfitting by simplifying the model.
Describe the structure and function of a single hidden layer feedforward neural network.
A single hidden layer feedforward neural network consists of an input layer, a hidden layer, and an output layer. Each neuron in the hidden layer computes a weighted sum of inputs, applies an activation function, and passes the result to the output layer.
How does backpropagation compute gradients efficiently and why is it referred to as ‘error backpropagation’?
Backpropagation calculates gradients efficiently using the chain rule to propagate errors backward through the network. It is called ‘error backpropagation’ because it starts by calculating the error at the output and then traces it back through the layers.
What is the impact of using dropout in neural networks, and how does it help mitigate overfitting?
Dropout randomly removes neurons during training, forcing the network to rely on different subsets of neurons. This prevents overfitting by making the network more robust and less dependent on specific neurons.
How is the loss function in a neural network defined for regression tasks, and how does gradient descent minimize it?
In regression tasks, the loss function is typically defined as the mean squared error between the predicted and actual values. Gradient descent minimizes this error by adjusting the weights to reduce the difference between predictions and true values.
Explain the significance of the sigmoid function in binary classification tasks and why it might cause the vanishing gradient problem.
The sigmoid function is commonly used in binary classification tasks to map predictions to probabilities. However, it can cause the vanishing gradient problem because its derivative becomes very small for large input values.
How does the learning rate schedule (adaptive learning rates) improve the performance of gradient descent algorithms?
Adaptive learning rates, as used in algorithms like Adam, adjust the learning rate dynamically based on the gradient’s behavior. This improves convergence by taking larger steps when possible and smaller steps when needed.