Autoencoders Flashcards
What are the variants of auto encoders?
– AE and PCA – Denoising – Sparse – Variational (VAE) – Convolutional, recurrent
Are auto encoders data-specific? What does it mean?
Yes, for example an autoencoder trained on pictures of faces would do a rather poor job of compressing pictures of trees, because the features it would learn would be face-specific
What are the 3 most important components of an autoencoder?
An encoding function, a decoding function, and loss function.
Loss function is a distance function between the amount of information loss between the compressed representation of your data and the decompressed representation (i.e. a “loss” function).
Give few applications of an auto encoder
It can be used for data compression, dimensionality reduction,
visualization, anomaly detection, data generator(!)
How can an auto encoder be used for dimensionality reduction?
For 2D visualization specifically, t-SNE (pronounced “tee-snee”) is probably the best algorithm around, but it typically requires relatively low-dimensional data.
So a good strategy for visualizing similarity relationships in high-dimensional data is to start by using an autoencoder to compress your data into a low-dimensional space (e.g. 32-dimensional), then use t-SNE for mapping the compressed data to a 2D plane.
What’s the analog for decoder and encoder in bigger networks?
The same for “bigger networks” with “coder” and “decoder” parts being a
convolutional or recurrent networks
How does a denoising auto encoder work?
Given a set of “clean” images add noise to them and
train an autoencoder on:
– Input: a noisy image
– Output: a clear image
• In this way the network “learns” what is essential in
an image and what is not!
What’s the key idea behind a sparse auto encoder?
The bottleneck layer computes some “essential features”
that are needed for input reconstruction.
• Usually, only a few features are really important;
the rest represent noise.
• => impose some constraints on the bottleneck layer:
e.g., “on average, activations are close to 0”.
What is the Kullback-Leibler divergence?
• => Kullback-Leibler divergence: measures the distance
between two probability distributions;
used as a part of the loss function.
What’s the key idea behind variational auto encoders?
The bottleneck layer represents a latent space:
vectors of features that are needed for reconstruction
• “to reconstruct an output we can slightly vary the values
of the these essential features”
• Assume that the latent features are “normally distributed”
=> Extend the bottleneck layer to include “means” and “std’s”.
What’s a disadvantage of discriminative models?
Discriminative models have several key limitations
• Can’t model P(X), i.e. the probability of seeing a certain image
• Thus, can’t sample from P(X), i.e. can’t generate new images
What’s a key advantage of generative models over discriminative ones?
Generative models (in general) : • Can model P(X), probability of seeing a certain image • Can generate new images
How do GANs work?
- Generator: generate fake samples, tries to fool the Discriminator
- Discriminator: tries to distinguish between real and fake samples
- Train them against each other
- Repeat this and we get better Generator and Discriminator
The Discriminator is trying to maximize its reward and the Generator is trying to minimize Discriminator’s reward (or maximize its loss)
What are the key ideas behind Deep Convolutional GANs?
Key ideas:
- Replace FC hidden layers with Convolutions
- Generator: Fractional-Strided convolutions
- Use Batch Normalization after each layer
- Inside Generator
- Use ReLU for hidden layers
- Use Tanh for the output layer
Why should we use GANs?
Why GANs?
• Sampling (or generation) is straightforward.
• Training doesn’t involve Maximum Likelihood estimation.
• Robust to Overfitting since Generator never sees the training data.
• Empirically, GANs are good at capturing the modes of the distribution.
What are some problems with GANs?
- Probability Distribution is Implicit
- Not straightforward to compute P(X).
- Thus Vanilla GANs are only good for Sampling/Generation.
- Training is Hard
- Non-Convergence
- Mode-Collapse
How do GANs work?
• GANs are generative models that are implemented using two
stochastic neural network modules: Generator and Discriminator.
• Generator tries to generate samples from random noise as input
• Discriminator tries to distinguish the samples from Generator and
samples from the real data distribution.
• Both networks are trained adversarially (in tandem) to fool the other
component. In this process, both models become better at their
respective tasks.
How do Laplacian Pyramid of adversarial networks works?
Generate high resolution (dimension) images by using a hierarchical system of GANs– Iteratively increase image resolution and quality.
• Generator 𝐺 generates the base image 𝐼 from random noise input 𝑧 .
• Generators (𝐺J,𝐺I,𝐺 ) iteratively generate the difference image (ℎ ) conditioned on previous small image (𝑙).
• This difference image is added to an up-scaled version of previous smaller image.
Example of coupled GAN
Different features: hair color, eyes, etc. joint probability
Learning a joint distribution of multi-domain images.
• Using GANs to learn the joint distribution with samples drawn from
the marginal distributions.
• Direct applications in domain adaptation and image translation
What are some advanced GAN extensions?
- Coupled GAN
- LAPGAN – Laplacian Pyramid of Adversarial Networks
- Adversarially Learned Inference
How do conditional GANs work?
• Differentiating Feature: Uses an Identity Preservation Optimization using an
auxiliary network to get a better approximation of the latent code (z*) for an
input image.
• Latent code is then conditioned on a discrete (one-hot) embedding of age
categories.
Energy based GANs
What are 2 problems with GANs and how do we solve them?
- -> Non-Convergence
- -> Mode-Collapse
Solutions
- -> Mini-Batch GANs
- -> Supervision with labels
Deep learning as well as GANs use SGD, what’s the difference?
DL: SGD has convergence guarantees (under certain conditions). Problem: With non-convexity, we might converge to local optima.
GANs: SGD was not designed to find the Nash equilibrium of a game. Problem: We might not converge to the Nash equilibrium at all.