DL-10 - Reinforcement learning Flashcards by Rikard Donnelly

DL-10 - Reinforcement learning

What type of data do you have in supervised learning?

Pairs of (x, y).

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What type of data do you have in unsupervised learning?

Only x, no label.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What type of data do you have in reinforcement learning?

State-action pairs.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is the goal of supervised learning?

Learning a mapping from x -> y.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is the goal of unsupervised learning?

Learn an underlying structure in the data.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is the goal of reinforcement learning?

Maximizing future reward over many time steps.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

How do children learn from interactions?

By receiving positive/negative rewards that they learn from. (See image)

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is reinforcement learning about?

Learning in a dynamic environment, where the learned/model can decide what actions to try.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is a model called in reinforcement learning?

They are typically called agents.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is the meta-model of reinforcement learning?

Take actions that affect the environment.
Observe the changes to the environment.

(See image)

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is an environment?

The dynamic and interactive context in which an agent learns and takes actions.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is an episode?

A sequence of actions that ends in a terminal state.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is the formula for total reward?

(See image)

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What formula is this? (See image)

Total reward.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What formula is this? (See image)

Discounted reward.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is the formula for discounted reward?

(See image)

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

How does the agent affect the environment?

Through its actions.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What does the agent observe from the environment? (2)

State changes
Rewards

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What does the Q-function do?

It captures the expected rewards for an action, 𝑎_t taken in a given state 𝑠_t.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is the name of the function that does the following?

“It captures the expected rewards for an action, 𝑎_t taken in a given state 𝑠_t.”

It’s named the “Q-function”.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What’s the formula for the Q-function?

(See image)

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is a policy?

The agent needs a policy, 𝜋(𝑠), to infer the best action to take at state, 𝑠.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What is 𝜋(𝑠)?

The policy function that evaluates the state s.

How well did you know this?

Not at all

Perfectly

DL-10 - Reinforcement learning

What’s the name of the function that evaluates a state s to decide on the best action to take?

It’s called the policy function, written 𝜋(𝑠).

How well did you know this?

Not at all

Perfectly

# DL-10 - Reinforcement learning What is the RL strategy?

The RL strategy is that the policy chooses an action that maximizes future reward.

# DL-10 - Reinforcement learning What is the formula for the RL strategy?

(See image)

# DL-10 - Reinforcement learning What formula is this? (See image)

The RL strategy.

# DL-10 - Reinforcement learning What are the major classes of RL algorithms?

- Value learning - Policy learning

# DL-10 - Reinforcement learning How does value learning work?

(See image)

# DL-10 - Reinforcement learning What type of RL algorithm is this?

Value learning

# DL-10 - Reinforcement learning How does policy learning work?

(See image)

# DL-10 - Reinforcement learning What type of algorithm is this?

Policy learning.

# DL-10 - Reinforcement learning When would you use value learning?

When your input space is limited

# DL-10 - Reinforcement learning When is value learning a better choice than policy learning?

Value learning is better when the environment is deterministic and the value function can be easily determined.

# DL-10 - Reinforcement learning When is value learning a bad choice?

Value learning is a bad choice when the state space is too large or continuous.

# DL-10 - Reinforcement learning When is policy learning a better choice than value learning?

Policy learning is better when the optimal policy is easier to find than the optimal value function.

# DL-10 - Reinforcement learning What class of RL algorithm is Q-learning?

Q-learning is a value-based learning algorithm.

# DL-10 - Reinforcement learning What does Q-learning try to do? (I.e. What choices will it make)

Perform the sequences of actions that will eventually lead to the maximum total reward, because it knows the expected rewards of each action at each step.

# DL-10 - Reinforcement learning What is this function? (See image)

The Q-function in Q-learning.

# DL-10 - Reinforcement learning What is the formula for the Q-function in Q-learning?

(See image)

# DL-10 - Reinforcement learning What starting values do we use for Q-values in Q-learning?

Arbitrary assumptions for Q-values, but they will be learned over time.

# DL-10 - Reinforcement learning What is the Bellman equation used for?

It's used to update Q-values in Q-learning.

# DL-10 - Reinforcement learning What is the Bellman equation (formula)?

(See image)

# DL-10 - Reinforcement learning In the Bellman equation, what is alpha?

Learning rate (or step size)

# DL-10 - Reinforcement learning In Q-learning, what is a Q-table?

A mapping between states-action pairs and Q-values.

# DL-10 - Reinforcement learning When is the Q-table updated?

After each step.

# DL-10 - Reinforcement learning When does the table end?

When an episode is done.

# DL-10 - Reinforcement learning How is the Q-table initialized?

With zeroes.

# DL-10 - Reinforcement learning What is the Q-table used for?

Q-table is used as a reference to view all possible actions for a given state and selects the action based on the max value of those actions.

# DL-10 - Reinforcement learning What are the modes the agent uses when interacting with the environment? (2)

- Exploration - Exploitation

# DL-10 - Reinforcement learning What is exploration in RL?

Trying something new. improves knowledge about each action, hopefully leading to a long-term benefit.

# DL-10 - Reinforcement learning What is exploitation in RL?

chooses the greedy action to get the most reward by exploiting the agent’s current Q-value estimates.

# DL-10 - Reinforcement learning What is epsilon-greedy action selection?

A way of balancing exploration and exploitation.

# DL-10 - Reinforcement learning What is the formula for epsilon-greedy action selection?

(See image)

# DL-10 - Reinforcement learning What are the challenges in Q-learning? (2)

- Large memory table, can exceed resources available. - Unrealistically high time use for exploration, has to explore every state-action pair.

# DL-10 - Reinforcement learning What is DQN short for?

Deep Q-network / Deep Q-learning Network

# DL-10 - Reinforcement learning What is a solution to the problems with Q-learning?

Deep Q-learning using neural networks.

# DL-10 - Reinforcement learning What does a deep Q-network do?

Approximates Q-values with ML.

# DL-10 - Reinforcement learning Describe what the architecture of a DQN network looks like.

(See image)

# DL-10 - Reinforcement learning What is the formula for Q-loss?

(See image)

# DL-10 - Reinforcement learning What are some problems with deep Q-learning? (2)

- non-stationary or unstable target - updates are correlated

# DL-10 - Reinforcement learning What are some solutions to DQN problems? (2)

- Use two networks - prediction and target (see image) - Experience replay

# DL-10 - Reinforcement learning How are the target/prediction networks trained in DQN?

Parameters are updated fro mthe prediction to the target network at every C iterations.

# DL-10 - Reinforcement learning What is experience replay?

A buffer of past experiences is used to stability training, by decorrelating the training examples in each batch used to update the NN.

# DL-10 - Reinforcement learning How is the experience replay buffer created?

(See image)

# DL-10 - Reinforcement learning Describe the full schematic of using DQNs.

(See image)

# DL-10 - Reinforcement learning List the DQN steps. (7)

1) At state s, selection an action a using an epsilon-greedy policy. 2) Perform the action and move to a new state s'. 3) Store transition in the replay buffer. 4) Sample random batches from replay buffer, calculate the loss. 5) Optimization (e.g. gradient descent) for prediction network. 6) After C iterations, copy prediction network params to target network. 7) Repeat for M episodes.

# DL-10 - Reinforcement learning What are the downsides of Q-learning? (2)

- Complexity - okay for small action, discrete action spaces. - Flexibility - Cannot learn stochastic policies.

# DL-10 - Reinforcement learning What is policy learning?

Directly optimizing the policy 𝜋(𝑠).

# DL-10 - Reinforcement learn How do you interpret the output of the policy function pi(s)?

The policy output P(a|s) is the probability that taking that action is going to result in the highest reward.

# DL-10 - Reinforcement learning What is the advantage of policy learning?

It's not constrained to a discrete action space. Can parameterize probability dists how we like, either discrete/continuous.

# DL-10 - Reinforcement learning What is PG short for?

Policy gradient

# DL-10 - Reinforcement learning What are the outputs of PG (2)?

Mean and variance as separate outputs.

# DL-10 - Reinforcement learning What are some limitations of using RL in the real world?

Cannot run a lot of policies in real life. E.g. car collisions near people is bad.

# DL-10 - Reinforcement learning How do we get around the limitations of using RL in the real world?

Simulate the environment virtually before deploying to the real world.

# DL-10 - Reinforcement learning What are some problems with RL simulators?

Not well suited for realistic simulation to facilitate transfer from virtual to real world.

DL-10 - Reinforcement learning Flashcards

(76 cards)