DL Fundamentals Flashcards

1
Q

Representational learning

A

Engineering representations is hard - requires technical and domain expertise

  • Representational learning is a set of method that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification (LeCun et al., 2014)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

List some activation functions

A

g(z) = 1 / (1 + e^(-z))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Loss functions

A
  1. Squared error

2. Log loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Squared error loss function

A

1/2 * (M(d) - t)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Log loss function

A

-((t * log(M(d)) + (1-t)*(log(1-M(d)))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Difference between loss and cost function

A
  • Loss function = measure of the prediction error on a single training instance
  • Cost function = measure of the average prediction error across a set of training instances
  • Cost functions allows us to add in regularization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Gradient Descent Algorithm

A
  1. Choose random weights
  2. Until convergence
    • Set all gradients to zero
    • For each training instance
      • Calculate the model output
      • Calculate loss
      • Update gradient sum for for each weight and bias
    • Update weights and bias using the weight update rule
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Backpropagation Algorithm

A
  • Werbos (1974)
  • Not widely used until 1986
  1. Initialize the weights to random small values (fanin)
  2. FF phase - Feed input data through the network from the inputs to the outputs
  3. Update the training error for the network based (based on target values for all output nodes)
  4. Error propagation phase - Feed error values back through the network, adjusting the weights along the way
  5. Repeat from 2 until the error values are sufficiently small or some other stopping condition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Forward pass algo

A

Require: L, network depth
Require: W[i], i is an element of {1…L}, weight matrices for each layer
Require: b[i], i is an element of {1…L}, bias terms for each . layer
Require: d, input descriptive features
Require: t, target features

a[0] = d
for I = 1 to L:
z[I] = W[I]*a[I-1] + b
a[I] = g[I]z[I]

M(d) = a[L]
Calculate L(M(d), t)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Backward Propagation algo

A

Require: A forward pass of network
Require: L, network depth
Require: W[i], i is an element of {1…L}, weight matrices for each layer
Require: b[i], i is an element of {1…L}, bias terms for each layer
Require: t, target features

Calculate da[L] #derivate of loss function
for I=L to 1:
   dz[I] = da[I] * g[I]'(z[I])
   dW[I] = dz[I] * a[I-1]T
   db[I] = dz[I]
   da[I-1] = W[I]T*dz[I]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Stochastic/Online GD

A
  • Choose random weights
  • Until convergence
    • Shuffle all training instances
    • For each training instance:
      • Perform f/w pass
      • Perform b/w pass
      • Update weights and biases using update rule
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Stochastic GD Update Rule

A
W[i] = W[i] - αdW[i]
b[i] = b[i] - αdb[i]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

GD Batch

A
  • Choose random weights
  • Until convergence
    • Set all gradient sums to 0
    • For each training instance:
      • Perform f/w pass
      • Perform b/w pass
      • Update gradient sum for each weight and bias term
    • Update weights and biases using update rule
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Batch GD Update rule

A
W[i] = W[i] - α(1/m)Σ(from j=0 to M) dW[i]j
b[i] = b[i] - α(1/m)Σ(from j=0 to M)db[i]j
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

GD Mini-batch

A
  • Choose random weights
  • Until convergence
    • Divide the training set into mini-batches of size s
    • For each mini-batch D(mb)
      • Set all gradient sums = 0
      • For each training instance in D(mb)
        • Perform f/w pass
        • Perform b/w pass
        • Update gradient sum for each weight and bias t
      • Update weight and bias term using update rule
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

GD Mini-batch update rule

A
W[i] = W[i] - α(1/s)Σ(from j=0 to S) dW[i]j
b[i] = b[i] - α(1/s)Σ(from j=0 to S)db[i]j
17
Q

Stochastic GD - Advantages

A
  • Easy to implement

- Fast learning

18
Q

Stochastic GD - Disadvantages

A

Noisy gradient signal

- Computationally expensive

19
Q

Batch GD - Advantages

A
  • Computationally efficient

- Stable gradient signal

20
Q

Batch GD - Disadvantages

A
  • Requires gradient accumulation
  • Premature convergence
  • Involves loading large datasets into memory
  • Thus can be slow
21
Q

Mini-batch GD - Advantages

A
  • Relatively computationally efficient
  • Does not require full datasets to be loaded into memory
  • Stable gradient signal
22
Q

Mini-batch GD - Disadvantages

A
  • Gradient accumulation

- Another hyper-parameter - minibatch size

23
Q

Talk abut representation learning in the context of classification tasks

A
  • Higher layers of representation amplifies aspects of the input that are important for discrimination and suppress irrelevant variations

Hinton nature paper: key aspect of deep learning is that layers of features are not designed by human engineers, they are learned from data using general-purpose learning procedures