Reinforcement Learning Flashcards
In the context of reinforcement learning, what is the gamma in the value function equation?
It is known as discount factor, and it determines how much the reinforcement learning agents cares about rewards in the distant future relative to those in the immediate future.
If γ=0, the agent will be completely myopic and only learn about actions that produce an immediate reward.
What is the advantage of a discount factor?
It firstly makes sure that the whole expression won’t converge and
to reflect that future rewards are less and less predictable so we cannot just sum them up
and third that it should be a finite number.
What values can the discount factor take and what does it tell us?
Gamma is usually between 0.9 and 0.99 depending on how stable the system is. A discount rate γ<1 ensures a converging geometric series of rewards.
What is the principle of reinforcement learning?
Given the assumptions, find the optimal policy that would maximize the reward.
What are the three approaches to RL?
Policy-based approach: where you find the policy achieving maximum reward
Value-based approach: where you find the maximum value under any policy
Model-based approach: where you build a transition model of the environment by building a lookahead model
In Atari Breakout, why do we need 4 consecutive frames?
4 frames are needed to contain info about ball direction, speed, acceleration, etc.
In Atari breakout, what is the output and what do they correspond to?
the output consists of 18 nodes that correspond to all possible positions of the joystick (left-right, up-down, 4
diagonals, neutral; plus “red button pressed”)
Give some examples of reinforcement learning
- -> Control physical systems: walk, fly, drive, swim, …
- -> Interact with users: retain customers, personalise channel, optimise user experience, …
- -> Solve logistical problems: scheduling, bandwidth allocation, elevator control, cognitive radio, power optimisation, ..
- -> Play games: chess, checkers, Go, Atari games, …
- -> Learn sequential algorithms: attention, memory, conditionalcomputation, activations, …
What’s the difference between policy and value function
Policy is a behaviour function selecting actions given states a = (s)
Value function Q (s; a) is expected total reward from state s and action a under policy
What is the Bellman equation?
It writes the “value” of a decision problem at a certain point in time in terms of the payoff from some initial choices and the “value” of the remaining decision problem that results from those initial choices.
How does deep Q learning work? Describe in 4 steps.
Represent value function by deep Q-network with weights w
Define objective function by mean-squared error in Q-values
Leading to the following Q-learning gradient
Optimise objective end-to-end by SGD, using dL(w)/dw
What are some Stability Issues with Deep RL?
Naive Q-learning oscillates or diverges with neural nets
- Data is sequential–Successive samples are correlated, non-iid
- Policy changes rapidly with slight changes to Q-values–Policy may oscillate & Distribution of data can swing from one extreme to another
- Scale of rewards and Q-values is unknown– Naive Q-learning gradients can be large unstable when backpropagated
How do you solve stability issues of Deep value-based reinforcement learning ?
Deep Q networks provides a stable solution to deep value-based RL by:
- Use experience replay – Break correlations in data, bring us back to iid setting & Learn from all past policies
- Freeze target Q-network by Avoiding oscillations and breaking correlations between Q-network and target
- Clip rewards or normalize network adaptively to sensible range– DQN clips the rewards to [ 1; +1] – This prevents Q-values from becoming too large – Ensures gradients are well-conditioned –Can’t tell difference between small and large rewards
What is the reward in DQN in Atari? What is fixed throughout the game?
Reward is change in score for that step. Network architecture and hyperparameters fixed across all games
What is the advantage of normalised DQN?
Normalized DQN uses true (unclipped) reward signal and outputs a scalar value in the “stable” range. Output is scaled and translated into Q-values, policy and value are adapted to ensure output lies in that stable region. Network parameters w are adjusted to keep Q-values constant