Deep Learning Flashcards

Question

Why can NNs benefit from averaging?

Answer 1

* They reach many different solutions * Differences in * Random initialization, * Random train batch selection, * Hyperparameters * Outcomes of non-determinism

Answer 2

* Combining several models * Train several models separately and let them vote on the output

Answer 3

* Works because not all models will make the same errors at test time * If errors are correlated perfectly, no advantage to averaging * If errors uncorrelated perfectly, expected squared error reduces linearly with ensemble size * On average, ensemble will perform at least well as any of its members and if errors are independent, the ensemble will perform better than its members. * Usually the key to winning ML competitions

Answer 4

* Stochastically disable/mask hidden units * Hidden units cannot co-adapt/conspire with each other * Hidden units must be more generally useful, encode better features. * Have to include a mask vector into backpropagation algorithm math * At test time, use a constant vector of expectation instead of binary mask vector. If you have .5 probabilty of dropping unit, instead just weight it by .5 * Analogy to sexual reproduction: Half of genes from each parent promotes genes that are robust.

Answer 5

* Add a penalty to objective function based on magnitude of paramters * L1 regularization * Penalty on sum of absolute values of parameters * Incudes sparse parameterization * L2 regularization * Penalty on sum of squared parameters * Large weights get extra punishment

Answer 6

* Feedforward networks trained to reproduce its input. * Supervised approach to unlabeled data. * Encoder function: h = f(x), or stochastic p(h|x) * Reconstruction: r = g(h), or stochastic p(x|h)

Answer 7

* Smaller hidden layer than visible layer, so it learns salient features of input (in training distribution). * Lossy compression.

Answer 8

* Add noise layer (Input -\> Noise Process p(x'|x) -\> Hidden -\> Reconstruction) * Noise could be Gaussian additive noise or randomly zeroing input units * Minimize reconstruction error like other autoencoders, but here noise is added to input. Goal is reconstruct uncorrupted version of input. * Enlarges receptive fields of hidden unit. * Uses more information from elsewhere in input to reconstruct output. * Learn a good internal representation as a consequence of learning to denoise

Answer 9

* We want to extract features that reflect variations in the training input * Add new term to loss function reflecting Jacobian of encoder * One term keeps reconstructive info * New term throws all information * Satisfying both means we have just the good reconstructive features. * Minimizing Jacobian minimizes partial derivatives of encoder. Smaller derivatives means encoder will change less with changes in input.

Answer 10

No lateral connections among visible units or among hidden units.

Answer 11

Model distribution over visible units x in iterms of hidden units h.

Answer 12

* Negated sum of products of weights and units * -hWx -bias\*h -bias\*x * Positive weights and active units increase energy * High energy means low probabilty * p(x) = exp(-ENERGY(x)) * exp(anything) is always positive, so there are no zero probability states * Network can settle into an equilibrium or stationary distribution

Answer 13

f(x) = log(1 + exp(x)) smoothed version of rectified linear unit

Answer 14

f(x) = max(0, x) One-sided activation function

Answer 15

* Each hidden unit is conditionally independent of each other given an input. * p(h | x) factorizes into product of each hidden unit activating, reduce(\*, p(h\_i | x)) * Learning rule works locally * Only information about x\_i and h\_j needed to update Wij * Biologically plausible

Answer 16

* Unit active with probability related to sigmoid function of inputs * If units are stochastic, then repeated top-down passes reveal distribution of sensory inputs that the model believes in. * Fantasies in network's thermal equilibrium show inputs network can generate * There are many ways to generate the observed data, so need to learn a probability distribution of idden variables.

Answer 17

Alternating Gibbs sampling to approximate sampling from joint distribution p(v,h) 1. Start with some random input 2. Update hidden units in parallel 3. Update visible units 4. Repeat until equilibrium

Answer 18

* Clamp input x (input) * Observe which h activate * Clamp activated h units * Observe x activated (reconstruction) * Update based on pairwise correlation of h and x units * freq\_diff: of data - of reconstruction * is frequency that feature j and visible unit i are both on together * Update weight wij by freq\_diff(i,j) \* learning rate. * Hebbian style learning

Answer 19

* Objective function has reconstruction error and L1 regularization term to induce sparseness * Reconstruction is product of dictionary matrix (weights) and hidden units * Great at feature extraction for other algorithms * Unsupervised learning

Answer 20

* Sparse coding algorithm trained on patches of images will extract features that are like V1 receptive fields * Edge detectors at different positions, orientations, and spatial frequency * Olshausen and Field, 1996

Deep Learning Flashcards

(44 cards)