Quiz 5 Flashcards
MDP: State
The possible scenarios of the world the agent can be in.
MDP: Actions
Set of actions the agent can take based on its state
MDP: Environment
- Environment produces a state which the agent can persist
- Gives rewards to agent for actions it takes
- Environment may be unknown, non-linear, stochastic and complex
Dynamic programming methods for solving MDPs
Bellman Optimality Equation - Update value matrix at each iteration by applying the Bellman equation until convergence.
RL: Why is setting data gathering policy to be the same as greedy train policies is a bad idea.
- Greedy training will not have enough incentive to explore other less rewarding states that may lead to higher reward
- Breaks IID
State value function (V-function)
“Expected discounted sum of rewards from state s”
State-action value function (Q-value)
“Expected cumulative reward upon taking action a in state s”
RL: 4 challenges of RL
- Evaluative feedback - need trial and error to find the right action
- Delayed feedback - actions may not lead to immediate reward
- Non-stationary - Data distribution of visited states changes when policy changes
- Fleeting nature of time and online data
RL: Components of DQN
- Experience replay
- Epsilon greedy search
- Q-update
MDP: Model
The transition function, meaning given a state and an action, what is the probability that the agent will be in the new state 𝑠′
MDP: Policy
Set of actions given for each state the agent is in. RL attempts to find the optimal policy which maximizes the reward.
MDP: Markovian property
Only the present matters
Bellman’s Equation
The true utility of a state is its immediate reward plus all discounted future rewards (utility)
Difference between value iteration and policy iteration
VI: Finds optimal value functions + policy extraction (just once)
PI: Policy evaluation + policy improvement (repeated)
Experience replay
Agent keeps memory bank that stores past experience. Instead of using immediate experience, sample from memory buffer.
REINFORCE (policy gradient)
- Define parameterized policy
- Generate trajectories based on policy. Gets state, actions and rewards.
- Compute objective function (expected sum of rewards over all time steps)
- Compute the gradient (loss with respect to policy params)
- Update policy params
- Repeat until convergence
Drawbacks of policy gradients
Coarse rewards. Can’t assign credit to subset of actions that were good or bad.
How does experience replay solve problem of correlated data
By randomly sampling from the replay buffer, the training data becomes less correlated. This helps to stabilize and accelerate the learning process.
Diff between Q-learning and Deep Q-Networks
How Q-values are represented.
Q-learning uses table of discrete state and action.
DQN uses NN to approximate Q-values.
VI: Time complexity per iteration
O(|S|^2 |A|)
VI / Q-learning - How does it differ in how it perform updates
Q loops over actions as well as states
Why do policy iteration?
Policy converges faster
Deep Q-learning - What 2 things to do for stability during learning
- Freeze Q_old and update Q_new parameters
- Set Q_old <- Q_new at regular intervals
Loss for Deep Q-learning
MSE Loss
Dependency of value/policy iteration
Must know transition and reward functions
2 strategies if transition and reward function unknown
- Estimate transition / reward function.
- Estimate Q-values from data (DQNs, etc)
What 2 components of trad RL does policy gradient not require?
- Environment model
- Reward function
Policy gradient: likelihood ratio policy gradient
increases the (log) probability of the trajectories with high reward and decreases the (log) probability of the trajectories with low reward
Key difference between TD Learning and SARSA
TD: Action in next state can be any action. Update is based on expected value over all possible next actions.
SARSA: Action in the next state is one actually taken in the environment. Update is based on the Q-value of the action actually chosen.
Define Q-value
Estimate of the reward you might get for taking an action in a given state.
Define few-shot learning
Build models and feature representations that will generalize or transfer to a new set of categories where we only have 1 to 5 examples per category.
Define semi-supervised learning
Build ok models, predict data, use high confidence ones to feed back into training data.
Benefit of doing semi-supervised learning with DL
Can do SSL in one pipeline, end-to-end,
How is SSL done end-to-end in DL (type of data)
Labeled + unlabeled examples both included in batch
How to get loss from unlabeled examples?
2 augmentations (weak/strong). Use weak prediction as baseline for strong predictions. Calculate loss function between them and backpropagate.
Few-shot: What substitute layer is shown to be effective compared to fully connected?
Cosine layer
Few-shot: Define N-Way K-Shot Task
Set up N tasks, each having K examples (small)
Few-shot: Why does cosine similarity work better than fully connected?
Scale invariant - only cares about difference in angle. FC layer may be too sensitive to magnitude.
Define meta-learning
Set up a set of smaller tasks (with train/test data) that prepares the learner for the new task it will see in actual test.
Ways to define meta-learner inspired by trad ML
KNN - Matching networks
Gaussian - Prototypical networks
Gradient descent - Meta-learner LSTM
Ways to define meta-learner inspired by black-box DL
MANN
SNAIL
Define autoencoders
Encoder/decoder architecture that compresses input into low-dimensional embedding then upsamples back to original image.
What’s the point of autoencoders
Embedding can be useful for downstream tasks.
No need for labels.
Examples of autoencoder tasks
- Jigsaw
- Colorise
- Rotation
Meta-learn: Key idea of MAML
Just learn an initialization using SGD.
Meta-learn: How MAML works
Take small batch of train/test, predict and backprop, do this for 4-10 steps. Then, just do normal gradient descent afterwards. Use learned params as “smart” initialization.
How to generate labels from unlabeled data using clustering
Parallel: random initialized CNN and K-means
1. CNN predict labels
2. K-means cluster, turn into labels
3. Compute loss between and backprop.
Autoencoder: Jigsaw - loss function
cross-entropy
Autoencoder: Rotation
cross-entropy
Autoencoder: Colorization
MSE
Instance discrimination - Inputs and outputs
Input: + / - examples
Output: Model that can discriminate between classes
Instance discrimination: How is loss measured
Contrastive loss
Instance discrimination: Define contrastive loss
Similarity (dot product) between positive augmented 1 and positive / augmented 2 + negative / augmented
Instance discrimination: Inputs and outputs
Input:
Positive example: 2 augmentations
Negative example: 1 augmentation
Output:
Contrastive loss
Instance discrimination: Point of using momentum encoders
Slow down the learner, stabilize learning.
Generative model - key idea
Use maximum likelihood and unlabeled dataset to create a model.
3 types of generative models
- Tractable density
- Variational
- Direct
GM: Tractable density
Simplify joint distribution and learn those params
GM: Variational
Learn distributions that approximate the true joint distribution
GM: Direct
Learn to generate samples from data distribution without modeling it.
GAN: 2 types of models
Generator + Discriminator
GAN: loss function (example of image detection)
cross-entropy (real or fake?)
GAN: Generator objective
Minimize discriminator’s performance
1 - D(G(z))
GAN: Discriminator objective
Maximize prediction of real image - D(x)
Minimize predicting fake images as real - 1-D(G(z)
GAN: Why do max-max game work better than min-max
Generator’s objective function doesn’t have good gradient properties.
Variational autoencoders (VAE) - what assumption does it require
Gaussian distribution
VAE - Why can’t we calculate maximum likelihood directly
Contains integral
VAE: What is the alternative to calculating maximum likelihood
variational lower bound
GAN: Example of training failing
Generator learns to memorize and output samples of your training data.
VAE: Output of encoder
Mu and sigma - output parameters of a distribution
VAE: How is mu/sigma from encoder used
Sample from it to generate example to feed to decoder
VAE: Output of decoder
Mu and sigma of original data’s distribution (X)
VAE: How is mu/sigma from decoder used
Sample from it to generate example to feed to encoder
VAE: reparameterization trick
Moves sampling process outside of computation graph that has to go all the way back to the encoder