Part 1 Flashcards
Name a function that models the all-or-nothing response of biological neurons.
Threshold
Why is the signum function not used in deep learning?
not differentiable at x=0
=> derivative of 0 or undefined → no learning would occur i.e. weights update would be 0 → never improve performance of network on the training data.
What is Dropout?
- regularization technique
- prevent overfitting
- dropping out units (hidden + visible) in a NN
What is the objective function of Rosenblatt’s perceptron?
Find weights that minimize the distance of misclassified samples to the decision boundary
Classification on the sign of the distance.
Why is it useful to learn a bias term in training?
helps in offsetting the result,
reducing errors during computation by activation functions,
and ensuring a non-null output even when the input is null.
–> providing flexibility in shifting the activation function, thus improving the model’s ability to fit the data and make more precise predictions
What is the task of Softmax function as the last layer in a neural network for a classification task?
produces a probability distribution over the classes for each input
By:
- rescale so that output sums up to 1
- produce non-negative output
= is normalized exponential function
What are the advantages of making a neural network deeper?
- Greater expressive power
- model can scale to large and complex datasets
Why?
+more layers –> more diverse paths through which information can flow
+exponential feature reuse + learns hierarchical representation of the data –> facilitates the extraction of increasingly abstract features
What does backpropagation do?
computes all gradients required for the optimization of the network.
What is the exploding gradient problem?
The updates in earlier layers can be increasingly large.
when the learning rate is too high –> positive feedback –> loss grows without bound
What is the vanishing gradient problem?
The updates in earlier layers can be negligibly small.
When learning rate is too small –>negative feedback –> gradient vanishes
What is the standard loss function for classification?
Cross-entropy loss: assumes that the outputs can be interpreted as probabilities that the input belongs to each class
Specifically, it assumes that the data follow a Bernoulli (for binary classification) or Multinoulli/Categorical (for multi-class classification) distribution.
What is the standard loss function for regression?
L2-loss: assumes that the residuals (i.e. differences between the true and predicted values) follow a Gaussian distribution.
What is Batch Gradient Descent? BGD
- Steepest GD
- computes the gradient of the cost function w.r.t the model parameters
- using the entire training dataset in each iteration
What is Stochastic (Online) Gradient Descent?
- computes the gradient of the cost function w.r.t the model parameters
- use 1 sample in each iteration
What is Mini-Batch SGD?
- computes the gradient of the cost function w.r.t the model parameters
- Use B «M random samples
What are the steps to train a neural network using the backpropagation algorithm and an optimizer like Stochastic Gradient Descent (SGD)?
-randomly initiate the weights and biases
-forward the input into the network, get the output
compute the difference between the estimation and prediction
-tune the weights and biases of each neuron to minimize the loss
-iterate until the weights are optimized
What is the idea of momentum-based learning?
Idea: Accelerate in directions with persistent gradients
-Parameter update based on current and past gradients i.e. use previous gradient directions to accelerate the training and become more robust against local minima.
What is the purpose of the Momentum used in different optimizers?
It stabilizes the training by computing the moving average over the previous gradients.
What is the zero-centering problem?
the lack of zero-centered output when the sigmoid function is used as an activation function in training neural networks.
–> covariate shift of successive layers
how to solve the zero-centering problem?
Batch normalization which standardizes the inputs to each layer to have zero mean and unit variance, reducing the amount of internal covariate shift.
What is the dying ReLUs problem?
- situation where a neuron gets stuck in the state where it only outputs 0 for any input
- A neuron’s weights get updated such that it starts to output 0 due to the input being negative → gradient for that neuron during backpropagation is also 0 →no more update
What are the disadvantages of the powerful neural network with fully connected layers that motivate CNN?
- size problem: too many trainable weights –> too expensive
- pixels are a bad representation
+ Highly correlated neurons
+Scale dependent –> struggle with images of different size
+ sensitive to Intensity variations - doesn’t take into account the spatial relationships between pixels
What are the advantages of CNN in comparison with fully connected neural networks?
-Local connectivity
-Weight sharing
-translational invariance (recognize patterns irrespective of their position in the input)
-grid-like alignment of images
What decides the choice of function to apply to the output of CNN for the classification problems?
- Multi-class classification: each instance belongs to exactly 1 class. → probabilities of diff. classes to sum up to 1
& make one of the probabilities significantly larger than the others → effectively deciding the class of the output. => use softmax function - Binary classification: in multi-label classification, each instance can belong to more than one class → output independent probabilities for each class => sigmoid function
What are the four essential building blocks of Convolutional Neural Networks?
- convolutional layer
- activation function
- pooling layer/ subsampling layer
- Fully connected layer
What is the function of the convolutional layer in CNN?
→ to detect local connectivity of features from the previous layers
Sliding a set of filters (or kernels) across the input image.
Each filter is responsible for learning some local feature within the image
FEATURE LEARNING –> produce a feature map