Part 1 Flashcards

1
Q

Name a function that models the all-or-nothing response of biological neurons.

A

Threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is the signum function not used in deep learning?

A

not differentiable at x=0
=> derivative of 0 or undefined → no learning would occur i.e. weights update would be 0 → never improve performance of network on the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Dropout?

A
  • regularization technique
  • prevent overfitting
  • dropping out units (hidden + visible) in a NN
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the objective function of Rosenblatt’s perceptron?

A

Find weights that minimize the distance of misclassified samples to the decision boundary
Classification on the sign of the distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is it useful to learn a bias term in training?

A

helps in offsetting the result,
reducing errors during computation by activation functions,
and ensuring a non-null output even when the input is null.
–> providing flexibility in shifting the activation function, thus improving the model’s ability to fit the data and make more precise predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the task of Softmax function as the last layer in a neural network for a classification task?

A

produces a probability distribution over the classes for each input
By:
- rescale so that output sums up to 1
- produce non-negative output
= is normalized exponential function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the advantages of making a neural network deeper?

A
  • Greater expressive power
  • model can scale to large and complex datasets
    Why?
    +more layers –> more diverse paths through which information can flow
    +exponential feature reuse + learns hierarchical representation of the data –> facilitates the extraction of increasingly abstract features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does backpropagation do?

A

computes all gradients required for the optimization of the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the exploding gradient problem?

A

The updates in earlier layers can be increasingly large.
when the learning rate is too high –> positive feedback –> loss grows without bound

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the vanishing gradient problem?

A

The updates in earlier layers can be negligibly small.
When learning rate is too small –>negative feedback –> gradient vanishes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the standard loss function for classification?

A

Cross-entropy loss: assumes that the outputs can be interpreted as probabilities that the input belongs to each class
Specifically, it assumes that the data follow a Bernoulli (for binary classification) or Multinoulli/Categorical (for multi-class classification) distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the standard loss function for regression?

A

L2-loss: assumes that the residuals (i.e. differences between the true and predicted values) follow a Gaussian distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Batch Gradient Descent? BGD

A
  • Steepest GD
  • computes the gradient of the cost function w.r.t the model parameters
  • using the entire training dataset in each iteration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Stochastic (Online) Gradient Descent?

A
  • computes the gradient of the cost function w.r.t the model parameters
  • use 1 sample in each iteration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Mini-Batch SGD?

A
  • computes the gradient of the cost function w.r.t the model parameters
  • Use B «M random samples
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the steps to train a neural network using the backpropagation algorithm and an optimizer like Stochastic Gradient Descent (SGD)?

A

-randomly initiate the weights and biases
-forward the input into the network, get the output
compute the difference between the estimation and prediction
-tune the weights and biases of each neuron to minimize the loss
-iterate until the weights are optimized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the idea of momentum-based learning?

A

Idea: Accelerate in directions with persistent gradients
-Parameter update based on current and past gradients i.e. use previous gradient directions to accelerate the training and become more robust against local minima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the purpose of the Momentum used in different optimizers?

A

It stabilizes the training by computing the moving average over the previous gradients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the zero-centering problem?

A

the lack of zero-centered output when the sigmoid function is used as an activation function in training neural networks.
–> covariate shift of successive layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

how to solve the zero-centering problem?

A

Batch normalization which standardizes the inputs to each layer to have zero mean and unit variance, reducing the amount of internal covariate shift.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the dying ReLUs problem?

A
  • situation where a neuron gets stuck in the state where it only outputs 0 for any input
  • A neuron’s weights get updated such that it starts to output 0 due to the input being negative → gradient for that neuron during backpropagation is also 0 →no more update
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the disadvantages of the powerful neural network with fully connected layers that motivate CNN?

A
  • size problem: too many trainable weights –> too expensive
  • pixels are a bad representation
    + Highly correlated neurons
    +Scale dependent –> struggle with images of different size
    + sensitive to Intensity variations
  • doesn’t take into account the spatial relationships between pixels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the advantages of CNN in comparison with fully connected neural networks?

A

-Local connectivity
-Weight sharing
-translational invariance (recognize patterns irrespective of their position in the input)
-grid-like alignment of images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What decides the choice of function to apply to the output of CNN for the classification problems?

A
  • Multi-class classification: each instance belongs to exactly 1 class. → probabilities of diff. classes to sum up to 1
    & make one of the probabilities significantly larger than the others → effectively deciding the class of the output. => use softmax function
  • Binary classification: in multi-label classification, each instance can belong to more than one class → output independent probabilities for each class => sigmoid function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the four essential building blocks of Convolutional Neural Networks?

A
  • convolutional layer
  • activation function
  • pooling layer/ subsampling layer
  • Fully connected layer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the function of the convolutional layer in CNN?

A

→ to detect local connectivity of features from the previous layers
Sliding a set of filters (or kernels) across the input image.
Each filter is responsible for learning some local feature within the image
FEATURE LEARNING –> produce a feature map

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the function of the activation function in CNN?

A

provide nonlinearity to learn complex patterns

28
Q

What is the function of the pooling layer?

A

→ Compress and aggregate information across spatial location
- save parameters, save computation, reduce overfitting
- Downsampled feature maps

29
Q

What is the function of the fully connected layer in CNN?

A

-traditionally used at the end of CNNs for classification tasks.
-connect every neuron to every neuron in the previous and subsequent layers, allowing the network to make predictions based on the high-level features learned in earlier layers.

30
Q

how can we do backpropagation through a convolutional layer?

A

convolving the output gradients with the flipped filter (horizontally and vertically)

31
Q

What is the purpose of convolution with 1x1 filter?

A
  • Bottleneck - layer
  • to flatten or merge channels so that the size of the network decreases –> fewer parameters, save computations
  • It also reduces overfitting
32
Q

What are the benefits of 1x1 convolution?

A
  1. 1x1 convolutions simply calculate inner products at each position
  2. Simple and efficient method to decrease the size of a network
  3. ** Learns dimensionality reduction, e.g., can reduce redundancy in your feature Maps
33
Q

how do we make CNN architecture better?

A

replace the last layer that fully connected with
flatten + 1x1 or NxN convolution + global average pooling

34
Q

what is the bias-variance trade-off?

A

balance btw bias and variance in the performance of a ML model
-bias: error
-variance: model’s sensitivity to fluctuations in the training data
-Simultaneously optimizing bias and variance is impossible in general

35
Q

what does it mean by high/low bias/variance?

A

Low Bias, High Variance:
-overly complex, capture noise
- overfitting: well on training, poor on new unseen data
High Bias, low variance:
-too simple, underfitting
-perform poorly on both training + test data
Balance/sensible
-capturing the underlying patterns without being too influenced by noise.
-It generalizes well to new, unseen data.

36
Q

What is model capacity?

A

the capacity of a model describes a variety of functions it can approximate.
related to nr. parameters

37
Q

How does the number of independent training samples affect the loss on the training and test set?

A

-Start with a small training data set –> suffer high test loss, low training loss, overfitting
-Optimize variance by using more training data –> higher model capacity but match the size of the training set
-More training data –> reduce test loss
-Too high model capacity –> bad overfitting

38
Q

How does the model capacity affect the loss on the training and test set?

A

Increase model capacity → decrease training + test loss
Up to overfitting point → start to produce bad overfitting: test loss increases
–> Increase in bias for a decrease in variance

39
Q

Techniques to address overfitting in a NN

A
  1. augment data
  2. adapt architecture
  3. adapt the training process
  4. preprocessing
  5. regularizer (in loss function)
  6. dropout
  7. use a validation set + use parameters with minimum validation loss as an early stop.
40
Q

How do we use data augmentation to avoid overfitting?

A

ensure: Every transformation to which the label should be invariant e.g. rotating transformation
1. random spatial transform
2. pixel transformation (change resolution, random noise, pixel distribution)

41
Q

What is the main idea of regularization in the loss function?

A

add a penalty term to the loss function

42
Q

What are ways for regularization in the loss function?

A
  • enforce small norm: L2 norm
  • enforce sparsity: L1 norm
43
Q

How do the weights behave when the network is trained with L1 and L2 regularization? in comparison with a network without regularization

A
  • L1 norm:
    +different shrinkage of weights
    +many weights are 0 esp. when lamda is large i.e. sparse
  • L2 norm:
    +weight decay (shrinkage)
    +small weight, more spread out or diffuse weight vectors
44
Q

What is the purpose of data normalization?

A

standardize the range of independent variables or features of data
–> prevent domination

45
Q

How should data normalization be used in training NN?

A
  • use training data ONLY
  • normalization of input data
  • normalization within the network
46
Q

What are some methods of data normalization?

A

-min/max
-z-score / variance normalization
- zero-centering / mean subtraction
- Batch normalization

47
Q

What is the benefit of using Batch normalization?

A
  • reduce Internal Covariate Shift
    i.e. It normalizes the distribution of the input for the layer that follows the Batch normalization layer
  • improve stability
48
Q

What is the internal covariate shift problem?

A

refers to the change in the distribution of network activations caused by adjustments in network parameters during training

49
Q

What are the reasons that leads to Internal Covariate Shift?

A
  1. ReLU is not zero-centered
  2. initialization and input distribution might not be normalized
  3. deeper nets –> amplified effect
50
Q

What is a self-normalizing neural network?

A
  • method to address the stability problem of SGD
  • SeLU + specific weight initialization
  • special form of dropout
    –> stable activations, stable training
51
Q

What is the aim of the dropout technique?

A
  • reduce co-adaptation (different neurons in the network become highly dependent on each other during training –> less robust model) –> independent features
52
Q

What is the idea of dropout?

A

randomly drops/deactivates a fraction of neurons in the network at each update cycle i.e. ignored during the forward pass
How:
-set activations to 0 with (1-p) –> need compensation for dropout effect i.e. test time: multiply activation with p
-drop connect (less efficient implementation)

53
Q

How would initialization affect convex optimization problems?

A
  • does not matter i.e. always reach the global minimum
  • bad initialization–> slow convergence, more computational resources
54
Q

How would initialization affect non-convex optimization problems?

A
  • does matter
  • NN with non-linearity is general non-convex
55
Q

How should biases be initialized?

A
  • simply initialized to 0
  • when using ReLU, a small positive constant e.g. 0.1 is better due to dying ReLU issue
56
Q

How should weights be initialized?

A
  • random
  • bad to initialize with 0
  • small uniform / Gaussian
    e.g. uniform random in the range [0,1]
57
Q

What is the idea of Xavier initialization?

A

-calibrate the variances for the forward pass by initializing with a zero-mean Gaussian
-takes the number of input features into account

58
Q

What is the idea of He initialization?

A

effective when used with activation functions that have a mean close to zero e.g. ReLU
- scale the weights by a factor that takes the number of input features into account –> helps keep the variance of the activations roughly the same across different layers.

59
Q

What is transfer learning?

A

reuse models / use a pre-trained model on a new problem
- for a different task on the same data
- on different data for the same task
- on different data for a different task

60
Q

How does transfer learning work?

A
  • weight transfer (diff task)
    e.g. pre-trained model –> image classification –> target model: object detection/ segmentation
  • transfer between modalities (diff data type)
61
Q

What are the benefits of transfer learning?

A
  • weight transfer:
    + Capitalizes on features learned by the source model,
    + speeding up training on the target task.
  • transfer between modalities
    +benefit from the representation learning achieved in one modality when training on a different modality.
    +Useful when labeled data is scarce in the target modality.
62
Q

What is multi-task learning (MTL)?

A

train a network simultaneously on multiple related tasks

63
Q

What is hard parameter sharing in MTL?

A
  • several hidden layers are shared between all tasks (usually feature extraction layer)
  • MTL of N tasks –> reduce chance of overfitting by an order of N
64
Q

What is soft parameter sharing?

A

Each model has its parameters
Instead of forcing equality, the distance between parameters is regularized as part of the loss function
Options e.g. l2-norm, trace-norm, …

65
Q

What are auxiliary tasks for?

A
  • to create a more stable network
  • additional tasks to the original task
    e.g. facial landmark detection + learning subtly related tasks e.g. face pose, smile/not smile, glasses/no glasses, gender
66
Q
A