Neural network fundamental Flashcards

Question

Epoch

Answer 1

* Single pass through the whole training set * We can divide a dataset of 2000 examples into batches of 500 then it will take 4 iterations to complete 1 epoch. Where Batch Size is 500 and Iterations is 4, for 1 complete epoch. * The number of batches is equal to number of iterations for one epoch.

Answer 2

* The loss should always go down for batch training * If it ever goes up then something is wrong (i.e learning rate is too high) * Minbatch training is slightly noiser, but the overall trend is that the loss is reducing with amount of iterations

Answer 3

* Each data sample is its own batch (size of batch = 1) * Very noisy, but * Won't ever completley converge to the minimum, but jump around it * Loosing speed up from vectorization

Answer 4

* Train on a subset of data (minbatch) rather than the whole training data * Benefites (Works best in practice as to batch training or stochastic) * See progress for gradient descent without processing the entire dataset * Converge quicker because each gradient step is smaller * Less likley to converge to local minima as opposed to batch * Takes less main memory * Randomize the samples in each minibatch each epoch

Answer 5

* Use the whole dataset during during each step of gradient descent * Slower than minbatch or stochastic * However, more accurate and precise progress

Answer 6

* Compute an exponentially weighted average of the gradients * Each gradient descent step depend on previous steps On iteration *t* compute dW, db on the current minbatch v_dW = ßv_dW + (1-ß)dW v_db = ßv_db + (1-ß)dW W = W - Πv_dW , b = b - Πv_db * The second line approximates a gradient descent step, where V_dWis an approximation to the gradient. * The vector V_dW is an exponentially weighted average of the last gradients dW, which reduces the oscillations in dW Hyperparameters: learning rate Π, and exponentially weighted average ß ß = 0.9 Average over the last 10 gradients ß = 0.5 Average over the last 2 gradients

Answer 7

* Exponentially weighted of the squares * Allows to select a larger learning rate

Answer 8

* Adaptive moment estimation * Combines momentum and RMSprop * ß₁ * Called the first moment exponentially weighted averge * 0.9 * ß₂ * Second moment (exponentially weighted averag eof squares) * 0.99 * Σ = 10^-8Used in order to not devide by zero

Answer 9

* Initial steps of learning we can have larger *learning rate* * After a while, gradient descent will jump around a minimum, then we want to select a smaller learning rate * Where Π₀initial learning rate Π = 1/(1+ decayRate\* epochsRun)Π₀ Π = 0.95^epochRun\* Π₀ (Exponentially decay) Π = Divide by 2 each epoch (Discrete staircase) * Decay rate becomes another hyper parameter * Manual decay * Manually decrease learning rate after gradient descent has been running * Only works if training small amount of models

Answer 10

* Split data into different parts train/dev/test * This split is used to tune the hyper parameters (i.e learning rate) of a model * Hyperparameters in a neural network * #Layers * #Neurons * Activations function * Learning rate * Momentum * If your data is small than use classical proportions 70/20/10 * If your data is big, then use modern (big data era) proportions 98/1/1

Answer 11

* **Bias:** Performance on the training data compared to optimal performance * **Variance:** difference between loss on training and validation data * High variance * Overfitting * Example (assume human error ≈ 0) * Train error: 1% * Dev error: 11% * High bias * Underfitting * Example (assume human error ≈ 0) * Train error: 15% * Dev error:16% * Low bias/low variance * Best case * Train error matches dev error * Example (assume human error ≈ 0) * Train error: 0.5% * Dev error: 1% * High bias/High variance * Worst case * Model very has high parameters and is flexible but mistrains on samples * Example (assume human error ≈ 0 in cat/dog pictures) * Train error: 15% * Dev error: 30%

Answer 12

* If High bias issue, then getting more data is not going to help * Less bias/variance tradeoff issue in NN

Answer 13

* Tuned as a hyper parameter * L1 - regularization * Weight vector *w* becomes sparse (contain many zeros) * L2 - regularization * Euclidearn norm

Answer 14

* Subtract each sample by zero mu = 1/m\*( Σ^m_i=1x_i ) x: = x - mu * Divide each sample by mu = 1/m\*( Σ^m_i=1x_i \*\*2 ) (\*\*2 means elementwise squaring) Get mu and sigma from training and use these same values to also normalize test data. * Why normalize * The scale in each input feature might differ drastically * The difference scaling in each features lead that gradient descent steps oscillate * More cerical controus leads to that gradient descent needs less steps to converge

Answer 15

* Makes hyperparameter search easier * Makes network much more robust to choice of paramters * Much bigger range of hyperparameters that work well * Apply same process of normalization of inputs for all z values (inputs to neurons) at each layer * Later layers are more robust to changes in earlier layers * Allows each layer to learn independently from other layers * Reduces covariate shift * If distribution of X changes then have to re-learn training algorithm, This is true even if the ground true mapping X=\>y remains unchanged * Batch norm at test time * Come up with a separate estimate for mean and deviation **during training** and not on the test set * i.e exponentially weighted average or any other average methods on minbatches

Answer 16

* Hidden parts * Deformable objects * Viewing objects from an werid angle

Answer 17

* Computes the gradient of the loss and updates the weights in the opposite direction of the gradient.

Answer 18

* Also known as backprop

Answer 19

* Associated with exponentially weighted average

Answer 20

In back-propagation, the gradient of the loss is computed with respect to all variables in the function such that parameters can be updated using gradient descent.

Answer 21

* The first is not correct, the number of paths would grow exponentially * The second one is correct

Answer 22

First and fourth is correct

Answer 23

Yes. We have to take the gradeint of each term separately

Neural network fundamental Flashcards

(57 cards)