Exam Flashcards
How can we create a classifier if we have a large dataset that is only partially labeled?
- Train a autoencoder using the unlabeled data, the latent vector will represent features.
- Remove the decoder and train a classifier on the latent vectors from the encoder. The classifier and encoder can be trained combined.
How can the L2 regularizer be interpreted in:
1) A MAP setting?
2) A geometric setting?
3) As noise?
1) As a Gaussian prior on the weights
2) Attenuation along small eigenvectors of the Hessian
3) Noise added to the input data, x’ = x + e where e Gaussian.
For a sigmoid and tanh, how can we explain that small eigenvectors reduce the model capacity?
The activation functions will operator more in the linear area.
What are the effects of these regularizers:
1) Early stopping
2) Droppout
3) Batch Normalization
1) Filtering weights similar to L2 regularization
2) Ensemble learning
3) Reduces covariate shift
For a random variable x_n approximating a random variable x:
1) What does it mean that x_n unbiased?
2) What does it mean that x_n asymptotically unbiased?
3) What is the MSE in terms of bias and variance?
4) What does it mean that x_n is consistent?
5) What does it mean that x_n is efficient?
1) E[x_n - x] = 0
2) lim n -> inf E[x_n - x] = 0
3) bias^2 + var
4) lim n-> inf P(|x_n - x| > e) -> 0
5) No other estimator has a lower MSE for n -> inf.
What is the formula for the optimal parameters in linear regression?
w = (X^TX)^{-1}X^Ty
What is:
1) Estimation error or Generalization gap?
2) Approximation error?
3) Optimization error?
1) Error from using empircal loss instead of expected loss
2) Error from model restriction
3) Error from not finding the exact global optima
What is a sufficient condition for a point being a local minima?
If the function is convex, the gradient is zero, and the Hessian positive semidefinit.
What is the definition of:
1) A convex set?
2) A convex function?
1) for x,y in A, lx + (1-l)y is also in A for l in [0,1].
2) f(lx_1 + (1-l)x_2) < lf(x_1) + (1-l)f(x) for l in [0,1].
Or equivalently if the epigraph is convex.
If a function L has hessian H and:
1) Strong convexity so H > mI
2) Lipschitz gradient so H < MI
3) we use a stepsize a = 2/(m+M)
What can we say about the convergence rate?
Convergence rate c = 1 - (m/M)
What does it mean that something has linear convergence and what is the overal complexity of getting the error smaller than e with linear convergence rate?
We need O(log (1/e)) steps to get L(x_k) - L(x*) < e.
O(n*log(1/e))
What is the complexity of getting the expected error smaller than e with one-sample stochastic gradient descent?
O(log(1/e))
In stochastic gradient descent, what is the tradeoff between:
1) small batch size large batch size
2) small step size large step size
1) Small batch size:
- Cheaper iterations, Stalls at less accurate results.
- For non convex optimization, can easier escape local minima.
2) Small step size:
- Slower initial convergence, Better final result.
- For non convex optimization, harder to escape local minima.
How can we decrease
1) Approximation error?
2) Estimation error?
3) Optimization error?
1)
Use a more expressive model ( For example less regularization)
2)
Use a less expressive model( For example more regularization)
Use a larger sample size
3)
More iterations
How does the number of samples needed to approximate a Lipschitz funciton to e accuracy increase with the number of dimensions, d, of the input?
O((1/e)^d)
What are the ideas behind:
1) SGD with momentum
2) Adagrad
3) RMSProp (or AdaDelta)
4) Adam
1) Add a fraction of former gradients at each step
2) Divide by magnitude of all the former gradients elementwise
3) Same as Adagrad but with decaying average.
4) Combination of 1 and 3.
How do you get the circulant matrix C(x) from a vector x?
Bye shifting x over the rows of C(x).
How can a convolution be described by circulant matricies?
x *’ y = C(x)y
What are some important properties of circulant matricies?
1) Commutativity
2) Assosiativity
3) Prod of circulant is circulant
Can be proven by using that circulant matricies are equivalent to convolutions.
What is a shift matrix, and name 2 important properties of shift matricies.
A shift matrix is a identity matrix shifted along the rows.
1) S has orthogonal eigenvectors
2) S^TS = SS^T = I
What is the connection between circulant and shift matricies?
A matrix X is ciculant iff it comutes wiht the shift matrix, SX = XS.
Is convolution shift invariant or equivariant?
Equivariant
What is the discrete fourier transform, and how is it connected to the shift matrix eigenvectors?
The discrete fourier transform for a vector x is Vx where V is the matrix with columns
v_k = 1/sqrt(n)[1, … e^(2piikj/n)…, e^(2piik(n-1)/n)]
V is the matrix of eigenvectors of the shift matrix.
What is a physical interpretation of the fourier basis.
It is the basis that minimizes the Dirichlet energy, x^TZX, where Z is the second order derivative matrix. (With shifted rows of [1,-2, 1])