Deep Learning Flashcards
data for convolutional networks
grid-like topology (1D time series and 2D images)
distinguishing feature of convolutional networks
CNNs use convolution (and not matrix multiplication) in some layer
convolution function
integral of the product two functions (after one is reversed and shifted)
(f * g)(t) = ∫ f(a)g(t-a) da
think of f as a measurement and g as a weighting function that values the most recent measuremnts
parts of convolution
main function: input, n-dimension array of data
weighting function: kernel, n-dimension array of parameters to adjust
output: feature map
computational features of convolutional networks
- sparse interactions - kernal usually much smaller than input
- tied weights - same set of weights applied throughout the input
- equivariant to translation - convolution will give same result if input is translated. An event detector on a time series will find same event if it’s moved.
stacked convolutional layers
receptive fields of deep units is larger (but also indirect) compared receptive field of shallower units
if layer 2 has a kernel width of 3, then each hidden unit receives input from 3 units.
if layer 3 also has a kernal width of 3, then these hidden units here receive indirect input from 9 inputs
stages of a convolutional layer
- Convolution stage: convolution to get linear activation function
- Detector stage: Nonlinear function on linear activations
- Pooling stage: Replace output at some location with a summary statistic of nearby units
pooling and translation
small changes in location won’t make big changes to the summary statistics in the regions that are pooled together
pooling makes network invariant to small translations
what convolution hard codes
the concept of a topology
(non convolutional models would have to discover the topology during learning)
local connection (as opposed to convolution)
like a convolution with a kernel width (patch size) of n, except with no parameter sharing.
each unit has a receptive field of n, but the incoming weights don’t have be the same in every receptive field.
iterated pixel labelling
suppose convolution step provides a label for a pixel. repeatedly applying the convolution on the labels creates a recurrent convolutional network.
repeated convolutional layers with shared weights across layers is a kind of recurrent network.
why convolutional networks can handle different input sizes
each convolution step scales the input. if you repeat the convolution an appropriate number of times, you can normalize the size.
convolutions for 2D audio
convolutions over time: invariant to shifts in time
convolutions over frequency: invariant to changes in frequency.
primary visual cortex
- V1 has a 2D structure matching the 2D structure of retinal image
- Simple cells inspired detectors in CNNs and respond to features in small localized receptive fields
- Complex cells inspired pooling units. They also respond to features but are invariant to small changes in input position.
- Inferotemporal cortex responds like last layer in CNN
differences between human vision and convolutional networks
- Human vision is low resolution outside of fovea. CNNs have full-resolution over whole image
- Vision integrates with other senses
- Top down processing happens in human system
- Human neurons likely have different activation and pooling functions
regularization
- Modifications to training regime to prevent overfitting
- Increasing training error for reduced testing error
- Trading increased bias for reduced variance
dataset augmentation strategies
- Adding transformations to training input (e.g., translating images a few pixels)
- Adding random noise to input data
- Model needs to find regions insensitive to small perturbations
- Not just a local minima but a local plateau
- Adversarial training
- Create inputs that the network will probably misclassify
noise robustness
- Adding noise to hidden units or weights
- Noise on weights captures uncertainty about parameter estimates
- Adding noise to output units
- Assume x% of labels are wrong, so model doesn’t overfit on bad training data
semi-supervised training
- Use both labeled P(x,y) and unlabeled P(x) data to estimate P(y|x)
- Want to learn a latent representation
- Have the generative model share representations parameters with the discriminative model
- Like having a prior that the structure of P(x) is connect to structure of P(y|x)
multi-task learning
- Have the model do different kinds of tasks
- Assume there is a set of factors that account for variance in input and that these factors are shared by different tasks. (Each task uses a subset of these factors.)
- Shared part of model should have good values bc they can generalize across tasks
- Common architecture
- Input layer
- Shared representation layers
- Task specific representation layers
- Output layers
early stopping
- As you overfit a model, training error continues to decrease and testing error starts to rise.
- So stop just stop at the test data’s local minimum
- Think of the number of training steps as a hyperparameter,
- Similar to L2 regularization, but instead of training several models to find optimal L2 value, we learn the optimal number of steps during training.
- Requires extra data for testing.
- Alternatively, remember number of steps to minima and retrain on training+test data but stop after the optimal number of steps
- Restricts model space to a smaller volume of parameter space
- If learning rate is R, can only explore R*n_steps of space
parameter tying and sharing
- Sharing: Force groups of parameters within a model to be equal
- Tying: Force parameters to be like parameters in another model
sparse representations (regularization)
- Penalize activation of the hidden units
- Many of the elements of the (hidden) representation are zero-ish.
bagging
- Bootstrap aggregating
- Bootstrap sample k new training datasets and train k new models
- On average two thirds of data set will be in each new data set
- 5-10 in an ensemble
- Hard to train so many large networks