General Deep Learning and Machine Learning Flashcards by Luiz Martins

What is the Vanishing Gradient Problem?

On Back-Propagation, the the value for earlier layers depends on the diferential of the activation unit for all posterior layers. This means that if those values are all smaller than 1, the value for the eartlier layers tends to 0.

How well did you know this?

Not at all

Perfectly

What is the most common activation unit nowadays and why?

ReLU, because it helps solve the vanishing gradient problem

How well did you know this?

Not at all

Perfectly

What are some ReLU variations?

-ELU (Exponential linear unit)
-Leaky ReLU (Gradient is not 0 for values smaller than 0)
-Swish (similar to ReLU, but more smooth, good for very deep network)
-Maxout
-PReLU

How well did you know this?

Not at all

Perfectly

What does the Softmax function do?

Output probabilities based on a group of input values

How well did you know this?

Not at all

Perfectly

True or False: You should always start with ReLU activation functions for all Neural networks

False, RNN tends to perform well with TanH

How well did you know this?

Not at all

Perfectly

What are CNN useful for?

For cases where data cannot be easily described by columns, since they can find features that are not in a specific spot

How well did you know this?

Not at all

Perfectly

What is an example of a classic use for CNNs?

Image treatmentW

How well did you know this?

Not at all

Perfectly

What are some examples of CNN layers and what do they do?

-Conv2D: Performs the actual convolution
-MaxPooling2D: Gets the max values from all near inputs
-Flatten: Converts 2D layer to 1D layer

How well did you know this?

Not at all

Perfectly

What are some famous CNN models?

-LeNet5
-AlexNet
-GoogLeNet
-ResNet

How well did you know this?

Not at all

Perfectly

What are the uses of RNN models?

Predicting future behaviour based on past behaviour

How well did you know this?

Not at all

Perfectly

What are the existing RNN topologies?

-Sequence to Sequence: Sequence of Value to predict Sequence of values
-Sequence to Vector: Sequence of Value to predict a single Vector
-Vector to Sequence: Vector to predict sequence of Values
-Encoder->Decoder: Sequence -> vector -> sequence

How well did you know this?

Not at all

Perfectly

True or False: Performing Back-propagation through time has no significant impact in training performance, with the number of time steps being irrelevant for training speed

False, training with many time steps make a RNN similar to a very deep neural network

How well did you know this?

Not at all

Perfectly

Whats are some RNN cells?

LSTM cell: Mantains long-term and short-term memory cells
GRU cell: Simplified LSTM that performs about as well

How well did you know this?

Not at all

Perfectly

What are some EC2/EMR instances appropriate for deep learning?

-P3
-P2
-G5
-G5g (also used for Android game streaming)
-Trn1 (optimized for training)
-Trn1n (more bandwidth than Trn1)
-Inf2 (Powered by AWS Inferentia, optimized for inference)

How well did you know this?

Not at all

Perfectly

What are the Pros and Cons of Small and Large learning rates

Large learning rates train faster, but can overshoot the correct solution. Small learning rates don’t have that problem, but are slower.

How well did you know this?

Not at all

Perfectly

Should you use small or large batch sizes when training?

Study These Flashcards

Small batch sizes, because they tend to not get stuck on local minima and because large batch sizes can converge on the wrong solution at random

What is the point of regularization techniques?

Study These Flashcards

To prevent overfitting

What are some techniques to prevent vanishing/exploding gradient?

Study These Flashcards

-Usage of better activation functions (ReLU)
-Multi-level hierarchy (Training Multiple Sub-models instead of a large one)
-Long-short term memory (LSTM)
-Residual networks (Ensemble of smaller networks)

What is the difference between L1 and L2 regularization in mathematical terms?

Study These Flashcards

In both cases, terms are added to the loss function being minimized. In L1, however, what is added is the sum of the weights being adjusted, while on L2 the sum of the square of the weights is added.

What are the comparative advantages and disadvantages between L1 and L2 regularization?

Study These Flashcards

L2 regularization is more computationally efficient and returns a dense network, only reducing irrelevant feature weights, while L1 is less efficient and returns sparse networks, reducing weights to 0. The main advantage of this is that it can be use for feature selection, making it possible to remove irrelevant features. If all features are relevant, however, L2 tends to be better.

What are some synonyms for Recall?

Study These Flashcards

-Sensitivity
-True Positive rate
-Completeness

What are some synonyms for precision?

Study These Flashcards

-Correct Positives rate

What is the True Negative rate?

Study These Flashcards

It is the same as the recall, but calculated for false values

How do you calculate F1?

Study These Flashcards

(2 x Precision x Recall) / (Precision + Recall)

What is the ROC curve?

It is a plot of the True Positive Rate for the False Positive Rate. It helps explain how well the classifier can distinguish a Positive and a Negative.

What is the AUC?

It is the area under the ROC curve. It gives the probability that given 2 samples, one true and one false, it ranks the true one above the false.

What is Bagging?

It is an ensemble training pratice where you train multiple versions of a model by selecting features at random for it.

What is Boosting?

It is an ensemble training practice where you train new models based on the successes and failures of previous ones

Which one has a higher tendency to overfit, Bagging or Boosting?

Boosting

What is the PR Curve?

A precision recall curve

True or False: Both Bagging and Boosting are unparalelizable

False, only Boosting is

General Deep Learning and Machine Learning Flashcards

(31 cards)