Deep Learning Flashcards

1
Q

Sigmoid Activation Function

A

Squashes numbers between [0, 1].

Advantage is it has a nice interpretation (firing rate of neuron).

Disadvantages are (1) saturated neurons kills gradient (2) output is not zero centered, leading to inefficient steps (3) computationally expensive.

Very rarely used anymore.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Tanh(x) Activation Function

A

Zero centered, but still kills gradients.

Generally better than sigmoid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ReLU (Rectified Linear Unit)

A

max(0, x)

Does not saturate in positive region, computationally efficient, converges faster than sigmoid and tanh

Can still die.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Leaky ReLU

A

Strictly better than ReLU by solving dying neuron problem.

Does not saturate, computationally inefficient, and still converted faster than sigmoid and tanh.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Preprocessing Data techniques

A

Zero center data, normalize (not necessary for images)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Methods for model regularization

A

Dropout
Batch Norm
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth

Generally want to use batch norm to start off with, add in some others if you see overfitting of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Dropout

A

The term “dropout” refers to dropping out the nodes (input and hidden layer) in a neural network (as seen in Figure 1). All the forward and backwards connections with a dropped node are temporarily removed, thus creating a new network architecture out of the parent network. The nodes are dropped by a dropout probability of p.

Generally, for the input layers, the keep probability, i.e. 1- drop probability, is closer to 1, 0.8 being the best as suggested by the authors. For the hidden layers, the greater the drop probability more sparse the model, where 0.5 is the most optimised keep probability, that states dropping 50% of the nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Batch Norm

A

Batch Normalization is extension of concept of normalization from just the input layer to the activations of each hidden layer throughout the neural network. By normalizing the activations of each layer, Batch Normalization helps to alleviate the internal covariate shift problem, which can hinder the convergence of the network during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data augmentation

A

Data augmentation is the process of artificially generating new data from existing data, primarily to train new machine learning (ML) models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly