Part 1 Flashcards
Name a function that models the all-or-nothing response of biological neurons.
Threshold
Why is the signum function not used in deep learning?
not differentiable at x=0
=> derivative of 0 or undefined → no learning would occur i.e. weights update would be 0 → never improve performance of network on the training data.
What is Dropout?
- regularization technique
- prevent overfitting
- dropping out units (hidden + visible) in a NN
What is the objective function of Rosenblatt’s perceptron?
Find weights that minimize the distance of misclassified samples to the decision boundary
Classification on the sign of the distance.
Why is it useful to learn a bias term in training?
helps in offsetting the result,
reducing errors during computation by activation functions,
and ensuring a non-null output even when the input is null.
–> providing flexibility in shifting the activation function, thus improving the model’s ability to fit the data and make more precise predictions
What is the task of Softmax function as the last layer in a neural network for a classification task?
produces a probability distribution over the classes for each input
By:
- rescale so that output sums up to 1
- produce non-negative output
= is normalized exponential function
What are the advantages of making a neural network deeper?
- Greater expressive power
- model can scale to large and complex datasets
Why?
+more layers –> more diverse paths through which information can flow
+exponential feature reuse + learns hierarchical representation of the data –> facilitates the extraction of increasingly abstract features
What does backpropagation do?
computes all gradients required for the optimization of the network.
What is the exploding gradient problem?
The updates in earlier layers can be increasingly large.
when the learning rate is too high –> positive feedback –> loss grows without bound
What is the vanishing gradient problem?
The updates in earlier layers can be negligibly small.
When learning rate is too small –>negative feedback –> gradient vanishes
What is the standard loss function for classification?
Cross-entropy loss: assumes that the outputs can be interpreted as probabilities that the input belongs to each class
Specifically, it assumes that the data follow a Bernoulli (for binary classification) or Multinoulli/Categorical (for multi-class classification) distribution.
What is the standard loss function for regression?
L2-loss: assumes that the residuals (i.e. differences between the true and predicted values) follow a Gaussian distribution.
What is Batch Gradient Descent? BGD
- Steepest GD
- computes the gradient of the cost function w.r.t the model parameters
- using the entire training dataset in each iteration
What is Stochastic (Online) Gradient Descent?
- computes the gradient of the cost function w.r.t the model parameters
- use 1 sample in each iteration
What is Mini-Batch SGD?
- computes the gradient of the cost function w.r.t the model parameters
- Use B «M random samples
What are the steps to train a neural network using the backpropagation algorithm and an optimizer like Stochastic Gradient Descent (SGD)?
-randomly initiate the weights and biases
-forward the input into the network, get the output
compute the difference between the estimation and prediction
-tune the weights and biases of each neuron to minimize the loss
-iterate until the weights are optimized
What is the idea of momentum-based learning?
Idea: Accelerate in directions with persistent gradients
-Parameter update based on current and past gradients i.e. use previous gradient directions to accelerate the training and become more robust against local minima.
What is the purpose of the Momentum used in different optimizers?
It stabilizes the training by computing the moving average over the previous gradients.
What is the zero-centering problem?
the lack of zero-centered output when the sigmoid function is used as an activation function in training neural networks.
–> covariate shift of successive layers
how to solve the zero-centering problem?
Batch normalization which standardizes the inputs to each layer to have zero mean and unit variance, reducing the amount of internal covariate shift.
What is the dying ReLUs problem?
- situation where a neuron gets stuck in the state where it only outputs 0 for any input
- A neuron’s weights get updated such that it starts to output 0 due to the input being negative → gradient for that neuron during backpropagation is also 0 →no more update
What are the disadvantages of the powerful neural network with fully connected layers that motivate CNN?
- size problem: too many trainable weights –> too expensive
- pixels are a bad representation
+ Highly correlated neurons
+Scale dependent –> struggle with images of different size
+ sensitive to Intensity variations - doesn’t take into account the spatial relationships between pixels
What are the advantages of CNN in comparison with fully connected neural networks?
-Local connectivity
-Weight sharing
-translational invariance (recognize patterns irrespective of their position in the input)
-grid-like alignment of images
What decides the choice of function to apply to the output of CNN for the classification problems?
- Multi-class classification: each instance belongs to exactly 1 class. → probabilities of diff. classes to sum up to 1
& make one of the probabilities significantly larger than the others → effectively deciding the class of the output. => use softmax function - Binary classification: in multi-label classification, each instance can belong to more than one class → output independent probabilities for each class => sigmoid function
What are the four essential building blocks of Convolutional Neural Networks?
- convolutional layer
- activation function
- pooling layer/ subsampling layer
- Fully connected layer
What is the function of the convolutional layer in CNN?
→ to detect local connectivity of features from the previous layers
Sliding a set of filters (or kernels) across the input image.
Each filter is responsible for learning some local feature within the image
FEATURE LEARNING –> produce a feature map
What is the function of the activation function in CNN?
provide nonlinearity to learn complex patterns
What is the function of the pooling layer?
→ Compress and aggregate information across spatial location
- save parameters, save computation, reduce overfitting
- Downsampled feature maps
What is the function of the fully connected layer in CNN?
-traditionally used at the end of CNNs for classification tasks.
-connect every neuron to every neuron in the previous and subsequent layers, allowing the network to make predictions based on the high-level features learned in earlier layers.
how can we do backpropagation through a convolutional layer?
convolving the output gradients with the flipped filter (horizontally and vertically)
What is the purpose of convolution with 1x1 filter?
- Bottleneck - layer
- to flatten or merge channels so that the size of the network decreases –> fewer parameters, save computations
- It also reduces overfitting
What are the benefits of 1x1 convolution?
- 1x1 convolutions simply calculate inner products at each position
- Simple and efficient method to decrease the size of a network
- ** Learns dimensionality reduction, e.g., can reduce redundancy in your feature Maps
how do we make CNN architecture better?
replace the last layer that fully connected with
flatten + 1x1 or NxN convolution + global average pooling
what is the bias-variance trade-off?
balance btw bias and variance in the performance of a ML model
-bias: error
-variance: model’s sensitivity to fluctuations in the training data
-Simultaneously optimizing bias and variance is impossible in general
what does it mean by high/low bias/variance?
Low Bias, High Variance:
-overly complex, capture noise
- overfitting: well on training, poor on new unseen data
High Bias, low variance:
-too simple, underfitting
-perform poorly on both training + test data
Balance/sensible
-capturing the underlying patterns without being too influenced by noise.
-It generalizes well to new, unseen data.
What is model capacity?
the capacity of a model describes a variety of functions it can approximate.
related to nr. parameters
How does the number of independent training samples affect the loss on the training and test set?
-Start with a small training data set –> suffer high test loss, low training loss, overfitting
-Optimize variance by using more training data –> higher model capacity but match the size of the training set
-More training data –> reduce test loss
-Too high model capacity –> bad overfitting
How does the model capacity affect the loss on the training and test set?
Increase model capacity → decrease training + test loss
Up to overfitting point → start to produce bad overfitting: test loss increases
–> Increase in bias for a decrease in variance
Techniques to address overfitting in a NN
- augment data
- adapt architecture
- adapt the training process
- preprocessing
- regularizer (in loss function)
- dropout
- use a validation set + use parameters with minimum validation loss as an early stop.
How do we use data augmentation to avoid overfitting?
ensure: Every transformation to which the label should be invariant e.g. rotating transformation
1. random spatial transform
2. pixel transformation (change resolution, random noise, pixel distribution)
What is the main idea of regularization in the loss function?
add a penalty term to the loss function
What are ways for regularization in the loss function?
- enforce small norm: L2 norm
- enforce sparsity: L1 norm
How do the weights behave when the network is trained with L1 and L2 regularization? in comparison with a network without regularization
- L1 norm:
+different shrinkage of weights
+many weights are 0 esp. when lamda is large i.e. sparse - L2 norm:
+weight decay (shrinkage)
+small weight, more spread out or diffuse weight vectors
What is the purpose of data normalization?
standardize the range of independent variables or features of data
–> prevent domination
How should data normalization be used in training NN?
- use training data ONLY
- normalization of input data
- normalization within the network
What are some methods of data normalization?
-min/max
-z-score / variance normalization
- zero-centering / mean subtraction
- Batch normalization
What is the benefit of using Batch normalization?
- reduce Internal Covariate Shift
i.e. It normalizes the distribution of the input for the layer that follows the Batch normalization layer - improve stability
What is the internal covariate shift problem?
refers to the change in the distribution of network activations caused by adjustments in network parameters during training
What are the reasons that leads to Internal Covariate Shift?
- ReLU is not zero-centered
- initialization and input distribution might not be normalized
- deeper nets –> amplified effect
What is a self-normalizing neural network?
- method to address the stability problem of SGD
- SeLU + specific weight initialization
- special form of dropout
–> stable activations, stable training
What is the aim of the dropout technique?
- reduce co-adaptation (different neurons in the network become highly dependent on each other during training –> less robust model) –> independent features
What is the idea of dropout?
randomly drops/deactivates a fraction of neurons in the network at each update cycle i.e. ignored during the forward pass
How:
-set activations to 0 with (1-p) –> need compensation for dropout effect i.e. test time: multiply activation with p
-drop connect (less efficient implementation)
How would initialization affect convex optimization problems?
- does not matter i.e. always reach the global minimum
- bad initialization–> slow convergence, more computational resources
How would initialization affect non-convex optimization problems?
- does matter
- NN with non-linearity is general non-convex
How should biases be initialized?
- simply initialized to 0
- when using ReLU, a small positive constant e.g. 0.1 is better due to dying ReLU issue
How should weights be initialized?
- random
- bad to initialize with 0
- small uniform / Gaussian
e.g. uniform random in the range [0,1]
What is the idea of Xavier initialization?
-calibrate the variances for the forward pass by initializing with a zero-mean Gaussian
-takes the number of input features into account
What is the idea of He initialization?
effective when used with activation functions that have a mean close to zero e.g. ReLU
- scale the weights by a factor that takes the number of input features into account –> helps keep the variance of the activations roughly the same across different layers.
What is transfer learning?
reuse models / use a pre-trained model on a new problem
- for a different task on the same data
- on different data for the same task
- on different data for a different task
How does transfer learning work?
- weight transfer (diff task)
e.g. pre-trained model –> image classification –> target model: object detection/ segmentation - transfer between modalities (diff data type)
What are the benefits of transfer learning?
- weight transfer:
+ Capitalizes on features learned by the source model,
+ speeding up training on the target task. - transfer between modalities
+benefit from the representation learning achieved in one modality when training on a different modality.
+Useful when labeled data is scarce in the target modality.
What is multi-task learning (MTL)?
train a network simultaneously on multiple related tasks
What is hard parameter sharing in MTL?
- several hidden layers are shared between all tasks (usually feature extraction layer)
- MTL of N tasks –> reduce chance of overfitting by an order of N
What is soft parameter sharing?
Each model has its parameters
Instead of forcing equality, the distance between parameters is regularized as part of the loss function
Options e.g. l2-norm, trace-norm, …
What are auxiliary tasks for?
- to create a more stable network
- additional tasks to the original task
e.g. facial landmark detection + learning subtly related tasks e.g. face pose, smile/not smile, glasses/no glasses, gender