Neural Networks Flashcards

1
Q

What is deep learning, and how does it contrast with other machine learning algorithms?

A
  • Subset of ML that is concerned with neural networks: how to use back propagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Vanishing Gradient

A
  • Happens when training deep NNs, particularly those with many layers, such as recurrent neural networks, and deep forward networks. It is characterized by the gradients of the loss function with respect to the weights becoming very small, almost approaching zero, as they are propagated back through the layers during back propagation. This results in slow or stalled learning because the weights are updated minimally.
  • What causes them?
    • Activation Functions
      • Sigmoid and hyperbolic tangent (tanH) squash their inputs into a small range (between 0 and 1 for sigmoid, and -1 and 1 for tanH), which can lead to small gradients. When these small gradients are multiplied across many layers during back propagation, they become exponentially smaller.
    • Deep Networks
      • Many Layers: networks with many layers, gradients need to be back propagated through each layer. If the gradients are small at any layer, they will continue to diminish as they are propagated backward.
    • Weight Initialization:
    • Poor initialization can exacerbate the vanishing gradient problem. If weights are initialized too small, the output of each layer and the corresponding gradients will also be small.
  • Effects:
    • Slow Convergence: takes long Tim to converge, and may not ever.
    • Poor Performance
    • Difficulty in Training Deep Networks.
  • Solutions:
    • Activation Functions
      • ReLU (Leaky ReLu), do not suffer from this problem as much because gradient not squashed to small values.
      • Swish and GELU: newer activation functions that also help mintage vanishing gradients.
    • Weight Initialization
      • Initialize weights with a variance that keeps the output of each layer within a reasonable range.
    • Batch Normalization
      • Normalize inputs of each layer to maintain gradient flow, making network less sensitive to weight initialization.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Exploding Gradient

A
  • Divergence, oscillating loss, and numerical instability.
  • How to fix
    • Gradient Clipping
    • Regularization
      • L2: adding a penalty to large weights helps constrain the growth of weights during training
      • Dropout: Randomly dropping units during training can prevent overfitting and control gradient growth.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain backpropagation in detail

A
  • Process of adjusting the weights of the network to minimize the error between the actual output and desired output.
  • Calculates gradient of the loss function with respect to each weight by using the chain rule of calculus.
  • Steps:
  • Forward Pass: input data passed, each neuron performs a linear combination of inputs, applies an activation function, and passes the result to the next layer
  • Compute Loss: output from forward pass is compared to the true labels to compute loss using loss function (MSE, Cross Entropy Loss)
  • Backward Pass: gradient of the loss function with respect to each height is computed, and chain rule is applied to propagate the error backward through the network, layer by layer, from the out layer to the input layer
    • Weight update: Weights are updated using an optimization algorithm (Gradient Descent) to minimize the loss.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a perceptron?

A

A perceptron is the simplest type of artificial neural network, consisting of a single layer with a step activation function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an activation function?

A

An activation function introduces non-linearity to a neural network, enabling it to solve complex problems. Examples: ReLu, sigmoid, and tanh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is gradient descent in neural netowrks?

A

Gradient descent is an optimization algorithm that iteratively updates the network’s weights by minimizing the loss function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a feedforward neural network?

A

A neural network where information moves in only one direction – from input nodes, through hidden nodes, to output nodes – without cycles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a loss function in neural networks?

A

A loss function measures how well the neural network’s predictions match the actual data. Common examples include MSE for regression and cross-entropy for classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is backpropagation?

A

It is an algorithm used for training neural networks, where the gradient of the loss function is calculated and propagated backward through the network to update weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is ReLU?

A

Rectified Linear Unit (ReLU) is an activation function defined as f(x) = max(0,x), introducing non-linearity without causing vanishing gradients (like sigmoid and tanh)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is dropout in neural networks?

A

Dropout is a regularization technique where random neurons are ignored during training, preventing overfitting by ensuring the network doesn’t rely too heavily on any on neuron. (like pruning of trees)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a convolutional neural network (CNN)?

A

CNN is a type of deep nn commonly used in image processing tasks. It uses convolutional layers to extract features from input images.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a recurrent neural network (RNN)?

A

RNN are a type of nn where connections form a directed cycle, making them effective for sequential data like time series or language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a long short-term memory (LSTM)?

A

Type of RNN that can learn long-term dependencies using memory cells to store information over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a fully connected layer?

A

A layer in which each neuron is connected to every neuron in the previous and the next layer, often used at the end of a network for final classification.

15
Q

What is batch normalization

A

A technique that normalizes the inputs to each layer in a network, speeding up training and improving performance by reducing internal covariate shift.

  • covariate shift: a phenomenon that occurs in neural networks when the distribution of inputs to the network changes during training. This happens when the network’s parameters are updated, which causes the distribution of inputs to subsequent layers to change. For example, in deep networks, the output of each layer feeds into the next, so when the parameters of one layer change, the distribution of inputs to the next layer changes as well.
16
Q

What is weight initialization?

A

Setting initial weights for a neural network before training. Poor initialization can lead to slow convergence or model failure. Techniques include Xavier and He initialization.

17
Q

What is learning rate?

A

A hyperparameter that controls the step size during gradient descent. Too large a rate can cause overshooting, while too small a rate can cause slow convergence

18
Q

What is a pooling layer in CNN’s?

A

A layer used to down-sample the spatial dimensions (width, height) of the input, reducing the number of parameters and computation in the network.

Say, a picture is in super high dimension and has 6 shades of pink in the corner, the pooling layer can merge the 6 shades into 1 representation, simplifying the model.

Examples: Max Pooling, Average Pooling.

19
Q

What is transfer learning?

A

A technique where a pre-trained model (usually on a large dataset) is fine-tuned for a specific task on a smaller dataset, improving performance and reducing training time.

20
Q

What is a softmax function?

A

A function that converts logits (raw model predictions) into probabilities for multi-class classification, ensuring they sum to 1.

Sigmoid is used for two class classification.

A, B, C, D –> model -> cross-entropy -> softmax -> 10% A, 20% B, 60% C, 10% D.

21
Q

Adam Optimzier

A

Adam Optimizer, or Adaptive Moment Estimation, is an optimization algorithm used to train deep neural networks in machine learning. It’s an improvement on the standard Stochastic Gradient Descent (SGD) optimizer and is considered the default algorithm for deep learning. Adam Optimizer is known for its fast convergence and robustness across problems.
Adam Optimizer works by adjusting the learning rate for each parameter in the model based on its gradient history. This helps the neural network learn more efficiently as a whole. Adam Optimizer keeps track of gradients from previous steps, but it doesn’t just average them. Instead, it uses a combination of recent and past gradient information, giving more weight to the recent.