Exam Flashcards
How can we create a classifier if we have a large dataset that is only partially labeled?
- Train a autoencoder using the unlabeled data, the latent vector will represent features.
- Remove the decoder and train a classifier on the latent vectors from the encoder. The classifier and encoder can be trained combined.
How can the L2 regularizer be interpreted in:
1) A MAP setting?
2) A geometric setting?
3) As noise?
1) As a Gaussian prior on the weights
2) Attenuation along small eigenvectors of the Hessian
3) Noise added to the input data, x’ = x + e where e Gaussian.
For a sigmoid and tanh, how can we explain that small eigenvectors reduce the model capacity?
The activation functions will operator more in the linear area.
What are the effects of these regularizers:
1) Early stopping
2) Droppout
3) Batch Normalization
1) Filtering weights similar to L2 regularization
2) Ensemble learning
3) Reduces covariate shift
For a random variable x_n approximating a random variable x:
1) What does it mean that x_n unbiased?
2) What does it mean that x_n asymptotically unbiased?
3) What is the MSE in terms of bias and variance?
4) What does it mean that x_n is consistent?
5) What does it mean that x_n is efficient?
1) E[x_n - x] = 0
2) lim n -> inf E[x_n - x] = 0
3) bias^2 + var
4) lim n-> inf P(|x_n - x| > e) -> 0
5) No other estimator has a lower MSE for n -> inf.
What is the formula for the optimal parameters in linear regression?
w = (X^TX)^{-1}X^Ty
What is:
1) Estimation error or Generalization gap?
2) Approximation error?
3) Optimization error?
1) Error from using empircal loss instead of expected loss
2) Error from model restriction
3) Error from not finding the exact global optima
What is a sufficient condition for a point being a local minima?
If the function is convex, the gradient is zero, and the Hessian positive semidefinit.
What is the definition of:
1) A convex set?
2) A convex function?
1) for x,y in A, lx + (1-l)y is also in A for l in [0,1].
2) f(lx_1 + (1-l)x_2) < lf(x_1) + (1-l)f(x) for l in [0,1].
Or equivalently if the epigraph is convex.
If a function L has hessian H and:
1) Strong convexity so H > mI
2) Lipschitz gradient so H < MI
3) we use a stepsize a = 2/(m+M)
What can we say about the convergence rate?
Convergence rate c = 1 - (m/M)
What does it mean that something has linear convergence and what is the overal complexity of getting the error smaller than e with linear convergence rate?
We need O(log (1/e)) steps to get L(x_k) - L(x*) < e.
O(n*log(1/e))
What is the complexity of getting the expected error smaller than e with one-sample stochastic gradient descent?
O(log(1/e))
In stochastic gradient descent, what is the tradeoff between:
1) small batch size large batch size
2) small step size large step size
1) Small batch size:
- Cheaper iterations, Stalls at less accurate results.
- For non convex optimization, can easier escape local minima.
2) Small step size:
- Slower initial convergence, Better final result.
- For non convex optimization, harder to escape local minima.
How can we decrease
1) Approximation error?
2) Estimation error?
3) Optimization error?
1)
Use a more expressive model ( For example less regularization)
2)
Use a less expressive model( For example more regularization)
Use a larger sample size
3)
More iterations
How does the number of samples needed to approximate a Lipschitz funciton to e accuracy increase with the number of dimensions, d, of the input?
O((1/e)^d)
What are the ideas behind:
1) SGD with momentum
2) Adagrad
3) RMSProp (or AdaDelta)
4) Adam
1) Add a fraction of former gradients at each step
2) Divide by magnitude of all the former gradients elementwise
3) Same as Adagrad but with decaying average.
4) Combination of 1 and 3.
How do you get the circulant matrix C(x) from a vector x?
Bye shifting x over the rows of C(x).
How can a convolution be described by circulant matricies?
x *’ y = C(x)y
What are some important properties of circulant matricies?
1) Commutativity
2) Assosiativity
3) Prod of circulant is circulant
Can be proven by using that circulant matricies are equivalent to convolutions.
What is a shift matrix, and name 2 important properties of shift matricies.
A shift matrix is a identity matrix shifted along the rows.
1) S has orthogonal eigenvectors
2) S^TS = SS^T = I
What is the connection between circulant and shift matricies?
A matrix X is ciculant iff it comutes wiht the shift matrix, SX = XS.
Is convolution shift invariant or equivariant?
Equivariant
What is the discrete fourier transform, and how is it connected to the shift matrix eigenvectors?
The discrete fourier transform for a vector x is Vx where V is the matrix with columns
v_k = 1/sqrt(n)[1, … e^(2piikj/n)…, e^(2piik(n-1)/n)]
V is the matrix of eigenvectors of the shift matrix.
What is a physical interpretation of the fourier basis.
It is the basis that minimizes the Dirichlet energy, x^TZX, where Z is the second order derivative matrix. (With shifted rows of [1,-2, 1])
What is the idea behind FFT?
Split the vector into odd, and even indicies and perform a fourier transform on them seperatly. This can be done repetadly and reduces computational complexity from O(n^2) to O(n log n)
When are two matricies, A and B, jointly diagonaliziable (Diagonalized by the same set of eigenvectors?)
Iff they commute AB = BA
How can we do 2D shifting on a nxn matrix A?
1) Stack the columns of A to a vector
2) Extend the shift matrix to N^2 x N^2
3) Multiply and restack to a matrix.
How can we do a 2D fourier transform?
Apply the Fourier transform row-wise than the inverse fourier transform column wise.
What is the number of multiplications for a convolution if the input image is 128x128x3, with a filter of size 5x5x10, no padding and stride 1.
Additions: 1241243105*5
How can we get approximate shift/ deformation invariance in convolutional networks?
Use pooling
What do we mean by factorized convolutions?
Use several layers with small kernels (3x3) instead of one large layer.
What types of unsupervised learning do we have?
1) Density estimation
2) Clustering
3) Feature learning
4) Dimensionality reduction
5) Data generation
What is kernel density estimation?
Creates a smoothed histogram of the datapoints.
What is the general GAN architecture?
z -> (Gen) -> x’
x’ and x -> Disc -> Classification (Real/ Fake)
What is the general cGAN architecture?
[z, c’] -> Gen -> x’
[x’, c’] and [x, c] -> Disc -> Classification (Real/ Fake)
What is the GAN Loss?
For Discriminator:
sum log(D(z)) + log(1 - D(G(z))
For Generator:
sum log(D(G(z))
Describe the algorithm for training GAN’s
1) Draw random batch from real samples
2) Draw random z batch and create fake samples G(z)
3) Do gradient descent on discriminator on both batches
4) Run discriminator on samples from generator
5) Do gradient descent on generator
Name some architecture guidelines that are usually a advantage for GAN’s.
1) Use Relu in generator and leaky relu in disc.
2) Use fully convolutional network
3) Use batch norm in all layers but the last for disc and first for gen.
4) Use tanh on output
5) Replace pooling with strided and fractional strided convolutions
What is a adversial autoencoder?
A normal autoencoder where we also have a discriminator on the latent z. The discriminator can force the z-distribution to be similar to a pre-choosen distribution.
How can we train a network for image to image translation (sketch to image…).
Use a GAN. Feed the scetch to the genereator and feed pairs of [scetch, generated image] and [scetch, true image] to the discriminator.
Regarding RNN’s name at least one example of:
1) Many to one problems
2) One to many problems
3) Many to Many problems
1) Sentement analysis
2) Image captioning
3) Translation
What is:
1) Lateral feedback?
2) Indirect feedback?
3) Direct feedback?
1) Feedback connection to a former cell
2) Feedback to a cell in the same layer
3) Feedback from the ouput to the input of the same cell
What is a Hopfield network?
A network with symetric connections to all cells
How can we make it easier to train deep RNN?
Batch normalization, dropout, bi-directional RNN.
What is the best way to deal with vansihing gradients in RNN’s?
change from basic RNN cell to LSTM or GRU
In a LSTM, what is the purpouse of:
1) The forget gate
2) The input gate
3) The output gate
1) f = sigmoid( W_f* [h,x] + b_f) is multiplied elementwise by the cell state, can reduce the importance of elements in the last state.
2) i = sigmoid( W_i* [h,x] + b_i) is elementwise multiplied by the tanh of the input, controlling what is added to the cell state
3) o = sigmoid( W_i* [h,x] + b_i) is elementwise multiplied by the tanh of the cell state, controlling what is sent to the output.
What is the formula for the new cell state
c = c’f + gi where * is elementwise multiplication and g = tanh(W_g [x’, h’] + b_g) and ‘ indicates the last time step.
What are peephole connections?
connections from previous cell state to the input of all gates.
What gates does the GRU have?
Reset gate, and update gate.
Name two ways we can implement many to one.
We can just use the last output or we can use mean/max/sum over all outputs.
When would we use CTC (Connectionist temporal classification)?
When we have a input where we don’t want to label each “frame”, but the whole input at once. For example audio to text.
What is the CTC loss?
ln sum Pr(p|X) for all paths with the correct output.
How is the CTC loss calculated?
Using the forward-backward algorithm. If a(s) denotes all paths to node s, and b(s) denotes the probability of all paths from s, then L = ln sum a(s)b(s) for all s in a correct path.
What is the problem with choosing the “best” path in CTC, and how can it be solved?
Choosing the most probable path, does not necessarily give the most probable output. This can be improved by beam search, prefix search decoding and constrained decoding.
How can we solve problems with a high input dimension at each timestep and several timesteps?
CNN + RNN
What is BPTT(Back propagation trough time)?
In BPTT we run the forward phase for the entire sequence (over time) without updating the weights. We then calculate the loss of all outputs of the RNN and let the gradient propagate back to eariler states (trough time).
What is sequence representation learning with RNNs?
Use a RNN “encoder” to get a single output form a sequence, and a RNN decoder to turn the ouput into a sequence. Can also use “attention” a combination of the ouputs from each cell in the encoder that is used as additional input to each cell in the decoder.
In the taxonomy of generative models, how is the:
1) VAE classified?
2) GAN classified?
1) Generative model -> Explicit density -> approximate density -> Variational
2) Generative model -> Implicit density -> Direct
What is the difference between a disciminative and a generative model?
A discriminative model attempts to estimate p(y|x), the probability of something given hte data.
A generative model attempts to estimeate p(y,x). This distribution can then be used to sample from.
Both models can be used for both regression and classification problems.
What is the optimal discriminator value in GAN’s?
D*(x) = p_data(x) / (p_G(x) + p_data(x))