Exam Flashcards

1
Q

How can we create a classifier if we have a large dataset that is only partially labeled?

A
  1. Train a autoencoder using the unlabeled data, the latent vector will represent features.
  2. Remove the decoder and train a classifier on the latent vectors from the encoder. The classifier and encoder can be trained combined.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can the L2 regularizer be interpreted in:

1) A MAP setting?
2) A geometric setting?
3) As noise?

A

1) As a Gaussian prior on the weights
2) Attenuation along small eigenvectors of the Hessian
3) Noise added to the input data, x’ = x + e where e Gaussian.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

For a sigmoid and tanh, how can we explain that small eigenvectors reduce the model capacity?

A

The activation functions will operator more in the linear area.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the effects of these regularizers:

1) Early stopping
2) Droppout
3) Batch Normalization

A

1) Filtering weights similar to L2 regularization
2) Ensemble learning
3) Reduces covariate shift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

For a random variable x_n approximating a random variable x:

1) What does it mean that x_n unbiased?
2) What does it mean that x_n asymptotically unbiased?
3) What is the MSE in terms of bias and variance?
4) What does it mean that x_n is consistent?
5) What does it mean that x_n is efficient?

A

1) E[x_n - x] = 0
2) lim n -> inf E[x_n - x] = 0
3) bias^2 + var
4) lim n-> inf P(|x_n - x| > e) -> 0
5) No other estimator has a lower MSE for n -> inf.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the formula for the optimal parameters in linear regression?

A

w = (X^TX)^{-1}X^Ty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is:

1) Estimation error or Generalization gap?
2) Approximation error?
3) Optimization error?

A

1) Error from using empircal loss instead of expected loss
2) Error from model restriction
3) Error from not finding the exact global optima

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a sufficient condition for a point being a local minima?

A

If the function is convex, the gradient is zero, and the Hessian positive semidefinit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the definition of:

1) A convex set?
2) A convex function?

A

1) for x,y in A, lx + (1-l)y is also in A for l in [0,1].
2) f(lx_1 + (1-l)x_2) < lf(x_1) + (1-l)f(x) for l in [0,1].
Or equivalently if the epigraph is convex.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

If a function L has hessian H and:

1) Strong convexity so H > mI
2) Lipschitz gradient so H < M
I
3) we use a stepsize a = 2/(m+M)

What can we say about the convergence rate?

A

Convergence rate c = 1 - (m/M)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does it mean that something has linear convergence and what is the overal complexity of getting the error smaller than e with linear convergence rate?

A

We need O(log (1/e)) steps to get L(x_k) - L(x*) < e.

O(n*log(1/e))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the complexity of getting the expected error smaller than e with one-sample stochastic gradient descent?

A

O(log(1/e))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In stochastic gradient descent, what is the tradeoff between:

1) small batch size large batch size
2) small step size large step size

A

1) Small batch size:
- Cheaper iterations, Stalls at less accurate results.
- For non convex optimization, can easier escape local minima.
2) Small step size:
- Slower initial convergence, Better final result.
- For non convex optimization, harder to escape local minima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can we decrease

1) Approximation error?
2) Estimation error?
3) Optimization error?

A

1)
Use a more expressive model ( For example less regularization)
2)
Use a less expressive model( For example more regularization)
Use a larger sample size
3)
More iterations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does the number of samples needed to approximate a Lipschitz funciton to e accuracy increase with the number of dimensions, d, of the input?

A

O((1/e)^d)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the ideas behind:

1) SGD with momentum
2) Adagrad
3) RMSProp (or AdaDelta)
4) Adam

A

1) Add a fraction of former gradients at each step
2) Divide by magnitude of all the former gradients elementwise
3) Same as Adagrad but with decaying average.
4) Combination of 1 and 3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you get the circulant matrix C(x) from a vector x?

A

Bye shifting x over the rows of C(x).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How can a convolution be described by circulant matricies?

A

x *’ y = C(x)y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are some important properties of circulant matricies?

A

1) Commutativity
2) Assosiativity
3) Prod of circulant is circulant

Can be proven by using that circulant matricies are equivalent to convolutions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a shift matrix, and name 2 important properties of shift matricies.

A

A shift matrix is a identity matrix shifted along the rows.

1) S has orthogonal eigenvectors
2) S^TS = SS^T = I

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the connection between circulant and shift matricies?

A

A matrix X is ciculant iff it comutes wiht the shift matrix, SX = XS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Is convolution shift invariant or equivariant?

A

Equivariant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the discrete fourier transform, and how is it connected to the shift matrix eigenvectors?

A

The discrete fourier transform for a vector x is Vx where V is the matrix with columns

v_k = 1/sqrt(n)[1, … e^(2piikj/n)…, e^(2piik(n-1)/n)]

V is the matrix of eigenvectors of the shift matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a physical interpretation of the fourier basis.

A

It is the basis that minimizes the Dirichlet energy, x^TZX, where Z is the second order derivative matrix. (With shifted rows of [1,-2, 1])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the idea behind FFT?

A

Split the vector into odd, and even indicies and perform a fourier transform on them seperatly. This can be done repetadly and reduces computational complexity from O(n^2) to O(n log n)

26
Q

When are two matricies, A and B, jointly diagonaliziable (Diagonalized by the same set of eigenvectors?)

A

Iff they commute AB = BA

27
Q

How can we do 2D shifting on a nxn matrix A?

A

1) Stack the columns of A to a vector
2) Extend the shift matrix to N^2 x N^2
3) Multiply and restack to a matrix.

28
Q

How can we do a 2D fourier transform?

A

Apply the Fourier transform row-wise than the inverse fourier transform column wise.

29
Q

What is the number of multiplications for a convolution if the input image is 128x128x3, with a filter of size 5x5x10, no padding and stride 1.

A

Additions: 1241243105*5

30
Q

How can we get approximate shift/ deformation invariance in convolutional networks?

A

Use pooling

31
Q

What do we mean by factorized convolutions?

A

Use several layers with small kernels (3x3) instead of one large layer.

32
Q

What types of unsupervised learning do we have?

A

1) Density estimation
2) Clustering
3) Feature learning
4) Dimensionality reduction
5) Data generation

33
Q

What is kernel density estimation?

A

Creates a smoothed histogram of the datapoints.

34
Q

What is the general GAN architecture?

A

z -> (Gen) -> x’

x’ and x -> Disc -> Classification (Real/ Fake)

35
Q

What is the general cGAN architecture?

A

[z, c’] -> Gen -> x’

[x’, c’] and [x, c] -> Disc -> Classification (Real/ Fake)

36
Q

What is the GAN Loss?

A

For Discriminator:
sum log(D(z)) + log(1 - D(G(z))
For Generator:
sum log(D(G(z))

37
Q

Describe the algorithm for training GAN’s

A

1) Draw random batch from real samples
2) Draw random z batch and create fake samples G(z)
3) Do gradient descent on discriminator on both batches

4) Run discriminator on samples from generator
5) Do gradient descent on generator

38
Q

Name some architecture guidelines that are usually a advantage for GAN’s.

A

1) Use Relu in generator and leaky relu in disc.
2) Use fully convolutional network
3) Use batch norm in all layers but the last for disc and first for gen.
4) Use tanh on output
5) Replace pooling with strided and fractional strided convolutions

39
Q

What is a adversial autoencoder?

A

A normal autoencoder where we also have a discriminator on the latent z. The discriminator can force the z-distribution to be similar to a pre-choosen distribution.

40
Q

How can we train a network for image to image translation (sketch to image…).

A

Use a GAN. Feed the scetch to the genereator and feed pairs of [scetch, generated image] and [scetch, true image] to the discriminator.

41
Q

Regarding RNN’s name at least one example of:

1) Many to one problems
2) One to many problems
3) Many to Many problems

A

1) Sentement analysis
2) Image captioning
3) Translation

42
Q

What is:

1) Lateral feedback?
2) Indirect feedback?
3) Direct feedback?

A

1) Feedback connection to a former cell
2) Feedback to a cell in the same layer
3) Feedback from the ouput to the input of the same cell

43
Q

What is a Hopfield network?

A

A network with symetric connections to all cells

44
Q

How can we make it easier to train deep RNN?

A

Batch normalization, dropout, bi-directional RNN.

45
Q

What is the best way to deal with vansihing gradients in RNN’s?

A

change from basic RNN cell to LSTM or GRU

46
Q

In a LSTM, what is the purpouse of:

1) The forget gate
2) The input gate
3) The output gate

A

1) f = sigmoid( W_f* [h,x] + b_f) is multiplied elementwise by the cell state, can reduce the importance of elements in the last state.
2) i = sigmoid( W_i* [h,x] + b_i) is elementwise multiplied by the tanh of the input, controlling what is added to the cell state
3) o = sigmoid( W_i* [h,x] + b_i) is elementwise multiplied by the tanh of the cell state, controlling what is sent to the output.

47
Q

What is the formula for the new cell state

A

c = c’f + gi where * is elementwise multiplication and g = tanh(W_g [x’, h’] + b_g) and ‘ indicates the last time step.

48
Q

What are peephole connections?

A

connections from previous cell state to the input of all gates.

49
Q

What gates does the GRU have?

A

Reset gate, and update gate.

50
Q

Name two ways we can implement many to one.

A

We can just use the last output or we can use mean/max/sum over all outputs.

51
Q

When would we use CTC (Connectionist temporal classification)?

A

When we have a input where we don’t want to label each “frame”, but the whole input at once. For example audio to text.

52
Q

What is the CTC loss?

A

ln sum Pr(p|X) for all paths with the correct output.

53
Q

How is the CTC loss calculated?

A

Using the forward-backward algorithm. If a(s) denotes all paths to node s, and b(s) denotes the probability of all paths from s, then L = ln sum a(s)b(s) for all s in a correct path.

54
Q

What is the problem with choosing the “best” path in CTC, and how can it be solved?

A

Choosing the most probable path, does not necessarily give the most probable output. This can be improved by beam search, prefix search decoding and constrained decoding.

55
Q

How can we solve problems with a high input dimension at each timestep and several timesteps?

A

CNN + RNN

56
Q

What is BPTT(Back propagation trough time)?

A

In BPTT we run the forward phase for the entire sequence (over time) without updating the weights. We then calculate the loss of all outputs of the RNN and let the gradient propagate back to eariler states (trough time).

57
Q

What is sequence representation learning with RNNs?

A

Use a RNN “encoder” to get a single output form a sequence, and a RNN decoder to turn the ouput into a sequence. Can also use “attention” a combination of the ouputs from each cell in the encoder that is used as additional input to each cell in the decoder.

58
Q

In the taxonomy of generative models, how is the:

1) VAE classified?
2) GAN classified?

A

1) Generative model -> Explicit density -> approximate density -> Variational
2) Generative model -> Implicit density -> Direct

59
Q

What is the difference between a disciminative and a generative model?

A

A discriminative model attempts to estimate p(y|x), the probability of something given hte data.
A generative model attempts to estimeate p(y,x). This distribution can then be used to sample from.

Both models can be used for both regression and classification problems.

60
Q

What is the optimal discriminator value in GAN’s?

A

D*(x) = p_data(x) / (p_G(x) + p_data(x))