General Deep Learning and Machine Learning Flashcards

1
Q

What is the Vanishing Gradient Problem?

A

On Back-Propagation, the the value for earlier layers depends on the diferential of the activation unit for all posterior layers. This means that if those values are all smaller than 1, the value for the eartlier layers tends to 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the most common activation unit nowadays and why?

A

ReLU, because it helps solve the vanishing gradient problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some ReLU variations?

A

-ELU (Exponential linear unit)
-Leaky ReLU (Gradient is not 0 for values smaller than 0)
-Swish (similar to ReLU, but more smooth, good for very deep network)
-Maxout
-PReLU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the Softmax function do?

A

Output probabilities based on a group of input values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

True or False: You should always start with ReLU activation functions for all Neural networks

A

False, RNN tends to perform well with TanH

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are CNN useful for?

A

For cases where data cannot be easily described by columns, since they can find features that are not in a specific spot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an example of a classic use for CNNs?

A

Image treatmentW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some examples of CNN layers and what do they do?

A

-Conv2D: Performs the actual convolution
-MaxPooling2D: Gets the max values from all near inputs
-Flatten: Converts 2D layer to 1D layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some famous CNN models?

A

-LeNet5
-AlexNet
-GoogLeNet
-ResNet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the uses of RNN models?

A

Predicting future behaviour based on past behaviour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the existing RNN topologies?

A

-Sequence to Sequence: Sequence of Value to predict Sequence of values
-Sequence to Vector: Sequence of Value to predict a single Vector
-Vector to Sequence: Vector to predict sequence of Values
-Encoder->Decoder: Sequence -> vector -> sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True or False: Performing Back-propagation through time has no significant impact in training performance, with the number of time steps being irrelevant for training speed

A

False, training with many time steps make a RNN similar to a very deep neural network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Whats are some RNN cells?

A

LSTM cell: Mantains long-term and short-term memory cells
GRU cell: Simplified LSTM that performs about as well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some EC2/EMR instances appropriate for deep learning?

A

-P3
-P2
-G5
-G5g (also used for Android game streaming)
-Trn1 (optimized for training)
-Trn1n (more bandwidth than Trn1)
-Inf2 (Powered by AWS Inferentia, optimized for inference)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the Pros and Cons of Small and Large learning rates

A

Large learning rates train faster, but can overshoot the correct solution. Small learning rates don’t have that problem, but are slower.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Should you use small or large batch sizes when training?

A

Small batch sizes, because they tend to not get stuck on local minima and because large batch sizes can converge on the wrong solution at random

17
Q

What is the point of regularization techniques?

A

To prevent overfitting

18
Q

What are some techniques to prevent vanishing/exploding gradient?

A

-Usage of better activation functions (ReLU)
-Multi-level hierarchy (Training Multiple Sub-models instead of a large one)
-Long-short term memory (LSTM)
-Residual networks (Ensemble of smaller networks)

19
Q

What is the difference between L1 and L2 regularization in mathematical terms?

A

In both cases, terms are added to the loss function being minimized. In L1, however, what is added is the sum of the weights being adjusted, while on L2 the sum of the square of the weights is added.

20
Q

What are the comparative advantages and disadvantages between L1 and L2 regularization?

A

L2 regularization is more computationally efficient and returns a dense network, only reducing irrelevant feature weights, while L1 is less efficient and returns sparse networks, reducing weights to 0. The main advantage of this is that it can be use for feature selection, making it possible to remove irrelevant features. If all features are relevant, however, L2 tends to be better.

21
Q

What are some synonyms for Recall?

A

-Sensitivity
-True Positive rate
-Completeness

22
Q

What are some synonyms for precision?

A

-Correct Positives rate

23
Q

What is the True Negative rate?

A

It is the same as the recall, but calculated for false values

24
Q

How do you calculate F1?

A

(2 x Precision x Recall) / (Precision + Recall)

25
What is the ROC curve?
It is a plot of the True Positive Rate for the False Positive Rate. It helps explain how well the classifier can distinguish a Positive and a Negative.
26
What is the AUC?
It is the area under the ROC curve. It gives the probability that given 2 samples, one true and one false, it ranks the true one above the false.
27
What is Bagging?
It is an ensemble training pratice where you train multiple versions of a model by selecting features at random for it.
28
What is Boosting?
It is an ensemble training practice where you train new models based on the successes and failures of previous ones
29
Which one has a higher tendency to overfit, Bagging or Boosting?
Boosting
30
What is the PR Curve?
A precision recall curve
31
True or False: Both Bagging and Boosting are unparalelizable
False, only Boosting is