Supervised learning and Convolutional Networks Flashcards
What does SGD stand for? Why do we use this approach and not a full approach? Mention at least 3 variants of SGD
SGD = Stochastic gradient descent Calculating gradients in every direction is too computationally expensive / not necessary (exploration). 1. SGD with momentum 2. RMSprop 3. Adam
Mention three strategies for architecture search
- Random search
- Genetic algorithms
- Reinforcement learning
Mention two learned neural architecture search networks and explain their difference.
NAS1 & NAS2.
NAS1 has a fixed layering form (conv2d, batchnorm, relu) with parameters: conv2d params, inputs to layer (skip connections).
NAS2 has a fixed overall architecture (CIFAR10/ImageNet), but wants to learn the cell structure of a “normal” and “reduction” cell.
The two basically (ish) handles and assumes the inverse of one another.
What is dilation rate?
How large the gaps are between elements in a feature map on which we apply a convolution filter.
What is another word for L2 regularization and why do we use it?
Least squared. We use it to minimize the complexity of the parameters in the model.
Mention ways to augment text and image data
Image: Mirroring, random crop, scale, aspect ratio, lightning
Text: Synonym insert, back-translation (google translate)