DL-10 - Reinforcement learning Flashcards
DL-10 - Reinforcement learning
What type of data do you have in supervised learning?
Pairs of (x, y).
DL-10 - Reinforcement learning
What type of data do you have in unsupervised learning?
Only x, no label.
DL-10 - Reinforcement learning
What type of data do you have in reinforcement learning?
State-action pairs.
DL-10 - Reinforcement learning
What is the goal of supervised learning?
Learning a mapping from x -> y.
DL-10 - Reinforcement learning
What is the goal of unsupervised learning?
Learn an underlying structure in the data.
DL-10 - Reinforcement learning
What is the goal of reinforcement learning?
Maximizing future reward over many time steps.
DL-10 - Reinforcement learning
How do children learn from interactions?
By receiving positive/negative rewards that they learn from. (See image)
DL-10 - Reinforcement learning
What is reinforcement learning about?
Learning in a dynamic environment, where the learned/model can decide what actions to try.
DL-10 - Reinforcement learning
What is a model called in reinforcement learning?
They are typically called agents.
DL-10 - Reinforcement learning
What is the meta-model of reinforcement learning?
- Take actions that affect the environment.
- Observe the changes to the environment.
(See image)
DL-10 - Reinforcement learning
What is an environment?
The dynamic and interactive context in which an agent learns and takes actions.
DL-10 - Reinforcement learning
What is an episode?
A sequence of actions that ends in a terminal state.
DL-10 - Reinforcement learning
What is the formula for total reward?
(See image)
DL-10 - Reinforcement learning
What formula is this? (See image)
Total reward.
DL-10 - Reinforcement learning
What formula is this? (See image)
Discounted reward.
DL-10 - Reinforcement learning
What is the formula for discounted reward?
(See image)
DL-10 - Reinforcement learning
How does the agent affect the environment?
Through its actions.
DL-10 - Reinforcement learning
What does the agent observe from the environment? (2)
- State changes
- Rewards
DL-10 - Reinforcement learning
What does the Q-function do?
It captures the expected rewards for an action, π_t taken in a given state π _t.
DL-10 - Reinforcement learning
What is the name of the function that does the following?
βIt captures the expected rewards for an action, π_t taken in a given state π _t.β
Itβs named the βQ-functionβ.
DL-10 - Reinforcement learning
Whatβs the formula for the Q-function?
(See image)
DL-10 - Reinforcement learning
What is a policy?
The agent needs a policy, π(π ), to infer the best action to take at state, π .
DL-10 - Reinforcement learning
What is π(π )?
The policy function that evaluates the state s.
DL-10 - Reinforcement learning
Whatβs the name of the function that evaluates a state s to decide on the best action to take?
Itβs called the policy function, written π(π ).
DL-10 - Reinforcement learning
What is the RL strategy?
The RL strategy is that the policy chooses an action that maximizes future reward.
DL-10 - Reinforcement learning
What is the formula for the RL strategy?
(See image)
DL-10 - Reinforcement learning
What formula is this? (See image)
The RL strategy.
DL-10 - Reinforcement learning
What are the major classes of RL algorithms?
- Value learning
- Policy learning
DL-10 - Reinforcement learning
How does value learning work?
(See image)
DL-10 - Reinforcement learning
What type of RL algorithm is this?
Value learning
DL-10 - Reinforcement learning
How does policy learning work?
(See image)
DL-10 - Reinforcement learning
What type of algorithm is this?
Policy learning.
DL-10 - Reinforcement learning
When would you use value learning?
When your input space is limited
DL-10 - Reinforcement learning
When is value learning a better choice than policy learning?
Value learning is better when the environment is deterministic and the value function can be easily determined.
DL-10 - Reinforcement learning
When is value learning a bad choice?
Value learning is a bad choice when the state space is too large or continuous.
DL-10 - Reinforcement learning
When is policy learning a better choice than value learning?
Policy learning is better when the optimal policy is easier to find than the optimal value function.
DL-10 - Reinforcement learning
What class of RL algorithm is Q-learning?
Q-learning is a value-based learning algorithm.
DL-10 - Reinforcement learning
What does Q-learning try to do? (I.e. What choices will it make)
Perform the sequences of actions that will eventually lead to the maximum total reward, because it knows the expected rewards of each action at each step.
DL-10 - Reinforcement learning
What is this function? (See image)
The Q-function in Q-learning.
DL-10 - Reinforcement learning
What is the formula for the Q-function in Q-learning?
(See image)
DL-10 - Reinforcement learning
What starting values do we use for Q-values in Q-learning?
Arbitrary assumptions for Q-values, but they will be learned over time.
DL-10 - Reinforcement learning
What is the Bellman equation used for?
Itβs used to update Q-values in Q-learning.
DL-10 - Reinforcement learning
What is the Bellman equation (formula)?
(See image)
DL-10 - Reinforcement learning
In the Bellman equation, what is alpha?
Learning rate (or step size)
DL-10 - Reinforcement learning
In Q-learning, what is a Q-table?
A mapping between states-action pairs and Q-values.
DL-10 - Reinforcement learning
When is the Q-table updated?
After each step.
DL-10 - Reinforcement learning
When does the table end?
When an episode is done.
DL-10 - Reinforcement learning
How is the Q-table initialized?
With zeroes.
DL-10 - Reinforcement learning
What is the Q-table used for?
Q-table is used as a reference to view all possible actions for a given state and selects the action based on the max value of those actions.
DL-10 - Reinforcement learning
What are the modes the agent uses when interacting with the environment? (2)
- Exploration
- Exploitation
DL-10 - Reinforcement learning
What is exploration in RL?
Trying something new. improves knowledge about each action, hopefully leading to a long-term benefit.
DL-10 - Reinforcement learning
What is exploitation in RL?
chooses the greedy action to get the most reward by exploiting the agentβs current Q-value estimates.
DL-10 - Reinforcement learning
What is epsilon-greedy action selection?
A way of balancing exploration and exploitation.
DL-10 - Reinforcement learning
What is the formula for epsilon-greedy action selection?
(See image)
DL-10 - Reinforcement learning
What are the challenges in Q-learning? (2)
- Large memory table, can exceed resources available.
- Unrealistically high time use for exploration, has to explore every state-action pair.
DL-10 - Reinforcement learning
What is DQN short for?
Deep Q-network / Deep Q-learning Network
DL-10 - Reinforcement learning
What is a solution to the problems with Q-learning?
Deep Q-learning using neural networks.
DL-10 - Reinforcement learning
What does a deep Q-network do?
Approximates Q-values with ML.
DL-10 - Reinforcement learning
Describe what the architecture of a DQN network looks like.
(See image)
DL-10 - Reinforcement learning
What is the formula for Q-loss?
(See image)
DL-10 - Reinforcement learning
What are some problems with deep Q-learning? (2)
- non-stationary or unstable target
- updates are correlated
DL-10 - Reinforcement learning
What are some solutions to DQN problems? (2)
- Use two networks - prediction and target (see image)
- Experience replay
DL-10 - Reinforcement learning
How are the target/prediction networks trained in DQN?
Parameters are updated fro mthe prediction to the target network at every C iterations.
DL-10 - Reinforcement learning
What is experience replay?
A buffer of past experiences is used to stability training, by decorrelating the training examples in each batch used to update the NN.
DL-10 - Reinforcement learning
How is the experience replay buffer created?
(See image)
DL-10 - Reinforcement learning
Describe the full schematic of using DQNs.
(See image)
DL-10 - Reinforcement learning
List the DQN steps. (7)
1) At state s, selection an action a using an epsilon-greedy policy.
2) Perform the action and move to a new state sβ.
3) Store transition in the replay buffer.
4) Sample random batches from replay buffer, calculate the loss.
5) Optimization (e.g. gradient descent) for prediction network.
6) After C iterations, copy prediction network params to target network.
7) Repeat for M episodes.
DL-10 - Reinforcement learning
What are the downsides of Q-learning? (2)
- Complexity - okay for small action, discrete action spaces.
- Flexibility - Cannot learn stochastic policies.
DL-10 - Reinforcement learning
What is policy learning?
Directly optimizing the policy π(π ).
DL-10 - Reinforcement learn
How do you interpret the output of the policy function pi(s)?
The policy output P(a|s) is the probability that taking that action is going to result in the highest reward.
DL-10 - Reinforcement learning
What is the advantage of policy learning?
Itβs not constrained to a discrete action space. Can parameterize probability dists how we like, either discrete/continuous.
DL-10 - Reinforcement learning
What is PG short for?
Policy gradient
DL-10 - Reinforcement learning
What are the outputs of PG (2)?
Mean and variance as separate outputs.
DL-10 - Reinforcement learning
What are some limitations of using RL in the real world?
Cannot run a lot of policies in real life.
E.g. car collisions near people is bad.
DL-10 - Reinforcement learning
How do we get around the limitations of using RL in the real world?
Simulate the environment virtually before deploying to the real world.
DL-10 - Reinforcement learning
What are some problems with RL simulators?
Not well suited for realistic simulation to facilitate transfer from virtual to real world.