Environment produces a state which the agent can persist Gives rewards to agent for actions it takes Environment may be unknown, non-linear, stochastic and complex

Quiz 5 Flashcards by t n

MDP: State

The possible scenarios of the world the agent can be in.

How well did you know this?

Not at all

Perfectly

MDP: Actions

Set of actions the agent can take based on its state

How well did you know this?

Not at all

Perfectly

MDP: Environment

Environment produces a state which the agent can persist
Gives rewards to agent for actions it takes
Environment may be unknown, non-linear, stochastic and complex

How well did you know this?

Not at all

Perfectly

Dynamic programming methods for solving MDPs

Bellman Optimality Equation - Update value matrix at each iteration by applying the Bellman equation until convergence.

How well did you know this?

Not at all

Perfectly

RL: Why is setting data gathering policy to be the same as greedy train policies is a bad idea.

Greedy training will not have enough incentive to explore other less rewarding states that may lead to higher reward
Breaks IID

How well did you know this?

Not at all

Perfectly

State value function (V-function)

“Expected discounted sum of rewards from state s”

How well did you know this?

Not at all

Perfectly

State-action value function (Q-value)

“Expected cumulative reward upon taking action a in state s”

How well did you know this?

Not at all

Perfectly

RL: 4 challenges of RL

Evaluative feedback - need trial and error to find the right action
Delayed feedback - actions may not lead to immediate reward
Non-stationary - Data distribution of visited states changes when policy changes
Fleeting nature of time and online data

How well did you know this?

Not at all

Perfectly

RL: Components of DQN

Experience replay
Epsilon greedy search
Q-update

How well did you know this?

Not at all

Perfectly

MDP: Model

The transition function, meaning given a state and an action, what is the probability that the agent will be in the new state 𝑠′

How well did you know this?

Not at all

Perfectly

MDP: Policy

Set of actions given for each state the agent is in. RL attempts to find the optimal policy which maximizes the reward.

How well did you know this?

Not at all

Perfectly

MDP: Markovian property

Only the present matters

How well did you know this?

Not at all

Perfectly

Bellman’s Equation

The true utility of a state is its immediate reward plus all discounted future rewards (utility)

How well did you know this?

Not at all

Perfectly

Difference between value iteration and policy iteration

VI: Finds optimal value functions + policy extraction (just once)
PI: Policy evaluation + policy improvement (repeated)

How well did you know this?

Not at all

Perfectly

Experience replay

Agent keeps memory bank that stores past experience. Instead of using immediate experience, sample from memory buffer.

How well did you know this?

Not at all

Perfectly

REINFORCE (policy gradient)

Define parameterized policy
Generate trajectories based on policy. Gets state, actions and rewards.
Compute objective function (expected sum of rewards over all time steps)
Compute the gradient (loss with respect to policy params)
Update policy params
Repeat until convergence

How well did you know this?

Not at all

Perfectly

Drawbacks of policy gradients

Coarse rewards. Can’t assign credit to subset of actions that were good or bad.

How well did you know this?

Not at all

Perfectly

How does experience replay solve problem of correlated data

By randomly sampling from the replay buffer, the training data becomes less correlated. This helps to stabilize and accelerate the learning process.

How well did you know this?

Not at all

Perfectly

Diff between Q-learning and Deep Q-Networks

How Q-values are represented.

Q-learning uses table of discrete state and action.

DQN uses NN to approximate Q-values.

How well did you know this?

Not at all

Perfectly

VI: Time complexity per iteration

O(|S|^2 |A|)

How well did you know this?

Not at all

Perfectly

VI / Q-learning - How does it differ in how it perform updates

Q loops over actions as well as states

How well did you know this?

Not at all

Perfectly

Why do policy iteration?

Policy converges faster

How well did you know this?

Not at all

Perfectly

Deep Q-learning - What 2 things to do for stability during learning

Freeze Q_old and update Q_new parameters
Set Q_old <- Q_new at regular intervals

How well did you know this?

Not at all

Perfectly

Loss for Deep Q-learning

MSE Loss

How well did you know this?

Not at all

Perfectly

Dependency of value/policy iteration

Must know transition and reward functions

2 strategies if transition and reward function unknown

1. Estimate transition / reward function. 2. Estimate Q-values from data (DQNs, etc)

What 2 components of trad RL does policy gradient not require?

1. Environment model 2. Reward function

Policy gradient: likelihood ratio policy gradient

increases the (log) probability of the trajectories with high reward and decreases the (log) probability of the trajectories with low reward

Key difference between TD Learning and SARSA

TD: Action in next state can be any action. Update is based on expected value over all possible next actions. SARSA: Action in the next state is one actually taken in the environment. Update is based on the Q-value of the action actually chosen.

Define Q-value

Estimate of the reward you might get for taking an action in a given state.

Define few-shot learning

Build models and feature representations that will generalize or transfer to a new set of categories where we only have 1 to 5 examples per category.

Define semi-supervised learning

Build ok models, predict data, use high confidence ones to feed back into training data.

Benefit of doing semi-supervised learning with DL

Can do SSL in one pipeline, end-to-end,

How is SSL done end-to-end in DL (type of data)

Labeled + unlabeled examples both included in batch

How to get loss from unlabeled examples?

2 augmentations (weak/strong). Use weak prediction as baseline for strong predictions. Calculate loss function between them and backpropagate.

Few-shot: What substitute layer is shown to be effective compared to fully connected?

Cosine layer

Few-shot: Define N-Way K-Shot Task

Set up N tasks, each having K examples (small)

Few-shot: Why does cosine similarity work better than fully connected?

Scale invariant - only cares about difference in angle. FC layer may be too sensitive to magnitude.

Define meta-learning

Set up a set of smaller tasks (with train/test data) that prepares the learner for the new task it will see in actual test.

Ways to define meta-learner inspired by trad ML

KNN - Matching networks Gaussian - Prototypical networks Gradient descent - Meta-learner LSTM

Ways to define meta-learner inspired by black-box DL

MANN SNAIL

Define autoencoders

Encoder/decoder architecture that compresses input into low-dimensional embedding then upsamples back to original image.

What's the point of autoencoders

Embedding can be useful for downstream tasks. No need for labels.

Examples of autoencoder tasks

1. Jigsaw 2. Colorise 3. Rotation

Meta-learn: Key idea of MAML

Just learn an initialization using SGD.

Meta-learn: How MAML works

Take small batch of train/test, predict and backprop, do this for 4-10 steps. Then, just do normal gradient descent afterwards. Use learned params as "smart" initialization.

How to generate labels from unlabeled data using clustering

Parallel: random initialized CNN and K-means 1. CNN predict labels 2. K-means cluster, turn into labels 3. Compute loss between and backprop.

Autoencoder: Jigsaw - loss function

cross-entropy

Autoencoder: Rotation

cross-entropy

Autoencoder: Colorization

MSE

Instance discrimination - Inputs and outputs

Input: + / - examples Output: Model that can discriminate between classes

Instance discrimination: How is loss measured

Contrastive loss

Instance discrimination: Define contrastive loss

Similarity (dot product) between positive augmented 1 and positive / augmented 2 + negative / augmented

Instance discrimination: Inputs and outputs

Input: Positive example: 2 augmentations Negative example: 1 augmentation Output: Contrastive loss

Instance discrimination: Point of using momentum encoders

Slow down the learner, stabilize learning.

Generative model - key idea

Use maximum likelihood and unlabeled dataset to create a model.

3 types of generative models

1. Tractable density 2. Variational 3. Direct

GM: Tractable density

Simplify joint distribution and learn those params

GM: Variational

Learn distributions that approximate the true joint distribution

GM: Direct

Learn to generate samples from data distribution without modeling it.

GAN: 2 types of models

Generator + Discriminator

GAN: loss function (example of image detection)

cross-entropy (real or fake?)

GAN: Generator objective

Minimize discriminator's performance 1 - D(G(z))

GAN: Discriminator objective

Maximize prediction of real image - D(x) Minimize predicting fake images as real - 1-D(G(z)

GAN: Why do max-max game work better than min-max

Generator's objective function doesn't have good gradient properties.

Variational autoencoders (VAE) - what assumption does it require

Gaussian distribution

VAE - Why can't we calculate maximum likelihood directly

Contains integral

VAE: What is the alternative to calculating maximum likelihood

variational lower bound

GAN: Example of training failing

Generator learns to memorize and output samples of your training data.

VAE: Output of encoder

Mu and sigma - output parameters of a distribution

VAE: How is mu/sigma from encoder used

Sample from it to generate example to feed to decoder

VAE: Output of decoder

Mu and sigma of original data's distribution (X)

VAE: How is mu/sigma from decoder used

Sample from it to generate example to feed to encoder

VAE: reparameterization trick

Moves sampling process outside of computation graph that has to go all the way back to the encoder

Quiz 5 Flashcards

(74 cards)