Neural network fundamental Flashcards

1
Q

Neural network representation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is deep learning taking off only now

A
  • Amounts of data
  • Faster computation (SPecialised GPU)
  • Algorihtms
    • RELU activation function faster than sigmoid (sigmoid has smaller gradients for larger local fields)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe Tanh() in terms of sigmoid

A

Tanh activation function is a shifted version of sigmoid where range is -1 to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Tanh activation vs sigmoid activation

A
  • Tanh is superior because the mean of the layer is closer to zero which makes learning for the next layer easier
  • Only use sigmoid as activation for outputlayer when doing binary classification (1-0 values for output)
  • A downside for both acivation functions is that if z (local field of the neuron) is very large, then the slop/gradient is very small, making gradient descent slow
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Softmax function

A
  • Each output is between 0 to 1
  • Output layers adds up to 1
  • Can be viewed as a vector of probabilities
  • Only used in outputlayer for multiclass classification
  • combined with cross entropy loss
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Softmax activation function:

Visualize Simple neural net with 1 softmax layer

A
  • Multiclass logistic regression
  • Linear decision boundaries separating classes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Forward propagation

A
  • For vectorized we can feed the whole dataset X
  • Each collumn result in activations layer represents an data input
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Vectorized forward propagation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Intution in deep learning

A
  • Earlier layers detecting simpler functions
  • then composing these function to form more complex patterns (i.e eyes, mouth)
  • Same applies for other domains not just images
    • i.e in sound low level functions can be audio/tone, later layers could represent complex forms such as phrases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Circuit theory and deep learning

A
  • Computing xor with one hidden layer is exponetionally large (right network)
  • While computing with several layers is uses less neurons (left image)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Optimization based learning

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Empirical risk

A
  • An approximation of the true expected loss based on training data.
  • Training data consists of a finite set of samples from the true distribution p(x,y) it does not caputre the distribution fully
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to make neural netowork generalize better on unseen data?

A
  • Adjust the loss to add regularization terms
  • Split data into training, validation, and test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Optimization based learning

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Empirical risk vs generlization

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Using training data, to formulate loss

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Local optima in neural networks

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Problem of plateau

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Maximum liklihood estimation

A
  • a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

One hot encoding

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

When should we use cross-entropy loss function ?

A

Classification when we use softmax function on the output layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

One hot encoding and the cross entropy loss

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Broadcasting

A
  • describes how numpy treats arrays with different shapes during arithmetic operations. the smaller array is “broadcast” across the larger array so that they have compatible shape
  • i.e C = A + b where C_{ij}= A_{ij} + b_j
    • In other words, the vector b is added to each row of the matrix.
    • this shorthand eliminates the need to define a matrix B copied into each row before doing the addition
24
Q

Batch vs minbatch

A

Batch means that you use all your data to compute the gradient during one iteration. Mini-batch means you only take a subset of all your data during one iteration.

25
Q

Epoch

A
  • Single pass through the whole training set
  • We can divide a dataset of 2000 examples into batches of 500 then it will take 4 iterations to complete 1 epoch. Where Batch Size is 500 and Iterations is 4, for 1 complete epoch.
  • The number of batches is equal to number of iterations for one epoch.
26
Q

How does the training progress look like for Batch training and min-batch training

A
  • The loss should always go down for batch training
    • If it ever goes up then something is wrong (i.e learning rate is too high)
  • Minbatch training is slightly noiser, but the overall trend is that the loss is reducing with amount of iterations
27
Q

Stochastic gradient descent

A
  • Each data sample is its own batch (size of batch = 1)
  • Very noisy, but
  • Won’t ever completley converge to the minimum, but jump around it
  • Loosing speed up from vectorization
28
Q

Minbatch-gradient descent

A
  • Train on a subset of data (minbatch) rather than the whole training data
  • Benefites (Works best in practice as to batch training or stochastic)
    • See progress for gradient descent without processing the entire dataset
    • Converge quicker because each gradient step is smaller
    • Less likley to converge to local minima as opposed to batch
    • Takes less main memory
  • Randomize the samples in each minibatch each epoch
29
Q

Batch training

A
  • Use the whole dataset during during each step of gradient descent
  • Slower than minbatch or stochastic
  • However, more accurate and precise progress
30
Q

Visually explain

  • Batch training
  • Minbatch training
  • Stochastic gradient descent
A
31
Q

Gradient descent with momentum. Describe the “momentum”

A
  • Compute an exponentially weighted average of the gradients
  • Each gradient descent step depend on previous steps

On iteration t compute dW, db on the current minbatch

vdW = ßvdW + (1-ß)dW

vdb = ßvdb + (1-ß)dW

W = W - ΠvdW , b = b - Πvdb

  • The second line approximates a gradient descent step, where VdW is an approximation to the gradient.
  • The vector VdW is an exponentially weighted average of the last gradients dW, which reduces the oscillations in dW

Hyperparameters: learning rate Π, and exponentially weighted average ß

ß = 0.9

Average over the last 10 gradients

ß = 0.5

Average over the last 2 gradients

32
Q

RMS prop

A
  • Exponentially weighted of the squares
  • Allows to select a larger learning rate
33
Q

Gradient descent with Adam. Describe Adam

A
  • Adaptive moment estimation
  • Combines momentum and RMSprop
  • ß1
    • Called the first moment exponentially weighted averge
    • 0.9
  • ß2
    • Second moment (exponentially weighted averag eof squares)
    • 0.99
    • Σ = 10-8 Used in order to not devide by zero
34
Q

Learning rate decay

A
  • Initial steps of learning we can have larger learning rate
  • After a while, gradient descent will jump around a minimum, then we want to select a smaller learning rate
  • Where Π0 initial learning rate

Π = 1/(1+ decayRate* epochsRun)Π0

Π = 0.95epochRun* Π0 (Exponentially decay)

Π = Divide by 2 each epoch (Discrete staircase)

  • Decay rate becomes another hyper parameter
  • Manual decay
    • Manually decrease learning rate after gradient descent has been running
    • Only works if training small amount of models
35
Q

Train/dev/test set

A
  • Split data into different parts train/dev/test
  • This split is used to tune the hyper parameters (i.e learning rate) of a model
    • Hyperparameters in a neural network
      • # Layers
      • # Neurons
      • Activations function
      • Learning rate
      • Momentum
  • If your data is small than use classical proportions 70/20/10
  • If your data is big, then use modern (big data era) proportions 98/1/1
36
Q

Bias/variance

A
  • Bias: Performance on the training data compared to optimal performance
  • Variance: difference between loss on training and validation data
  • High variance
    • Overfitting
    • Example (assume human error ≈ 0)
      • Train error: 1%
      • Dev error: 11%
  • High bias
    • Underfitting
    • Example (assume human error ≈ 0)
      • Train error: 15%
      • Dev error:16%
  • Low bias/low variance
    • Best case
    • Train error matches dev error
    • Example (assume human error ≈ 0)
      • Train error: 0.5%
      • Dev error: 1%
  • High bias/High variance
    • Worst case
    • Model very has high parameters and is flexible but mistrains on samples
    • Example (assume human error ≈ 0 in cat/dog pictures)
      • Train error: 15%
      • Dev error: 30%
37
Q

Basic recipe for NN

A
  • If High bias issue, then getting more data is not going to help
  • Less bias/variance tradeoff issue in NN
38
Q

Regularization

A
  • Tuned as a hyper parameter
  • L1 - regularization
    • Weight vector w becomes sparse (contain many zeros)
  • L2 - regularization
    • Euclidearn norm
39
Q

Show how regularization “decays the weight”

A
40
Q

Normalizing data

A
  • Subtract each sample by zero

mu = 1/m*( Σmi=1 xi )

x: = x - mu
* Divide each sample by

mu = 1/m*( Σmi=1 xi **2 ) (**2 means elementwise squaring)

Get mu and sigma from training and use these same values to also normalize test data.

  • Why normalize
    • The scale in each input feature might differ drastically
    • The difference scaling in each features lead that gradient descent steps oscillate
    • More cerical controus leads to that gradient descent needs less steps to converge
41
Q

Batch normalization

A
  • Makes hyperparameter search easier
  • Makes network much more robust to choice of paramters
    • Much bigger range of hyperparameters that work well
  • Apply same process of normalization of inputs for all z values (inputs to neurons) at each layer
  • Later layers are more robust to changes in earlier layers
  • Allows each layer to learn independently from other layers
  • Reduces covariate shift
    • If distribution of X changes then have to re-learn training algorithm, This is true even if the ground true mapping X=>y remains unchanged
  • Batch norm at test time
    • Come up with a separate estimate for mean and deviation during training and not on the test set
    • i.e exponentially weighted average or any other average methods on minbatches
42
Q

Explain informally how regularization reduces variance

A
43
Q

Give reasons why Image classification is so hard

A
  • Hidden parts
  • Deformable objects
    • Viewing objects from an werid angle
44
Q

Define Gradeint descent, and show how to update the weights

A
  • Computes the gradient of the loss and updates the weights in the opposite direction of the gradient.
45
Q

Gradeint descent. How does each step change cost function

A
46
Q

Forward mode differentiation

A
47
Q

Backward mode- differentiation

A
  • Also known as backprop
48
Q
A
49
Q
A
50
Q
A
51
Q

Exponentially weighted average

A

.

52
Q

Bias correction

A
  • Associated with exponentially weighted average
53
Q

Backpropagation

A

In back-propagation, the gradient of the loss is computed with respect to all variables in the function such that parameters can be updated using gradient descent.

54
Q
A
  • The first is not correct, the number of paths would grow exponentially
  • The second one is correct
55
Q

Adam optimization

A

First and fourth is correct

56
Q

Does the runtime for performing Backpropagation depend on the amount of samples

A

Yes. We have to take the gradeint of each term separately

57
Q
A

Quickly