Reinforcement Learning Flashcards

Question 1

Q

In the context of reinforcement learning, what is the gamma in the value function equation?

Answer

A

It is known as discount factor, and it determines how much the reinforcement learning agents cares about rewards in the distant future relative to those in the immediate future.

If γ=0, the agent will be completely myopic and only learn about actions that produce an immediate reward.

Question 2

Q

What is the advantage of a discount factor?

Answer

A

It firstly makes sure that the whole expression won’t converge and
to reflect that future rewards are less and less predictable so we cannot just sum them up
and third that it should be a finite number.

Question 3

Q

What values can the discount factor take and what does it tell us?

Answer

A

Gamma is usually between 0.9 and 0.99 depending on how stable the system is. A discount rate γ<1 ensures a converging geometric series of rewards.

Question 4

Q

What is the principle of reinforcement learning?

Answer

A

Given the assumptions, find the optimal policy that would maximize the reward.

Question 5

Q

What are the three approaches to RL?

Answer

A

Policy-based approach: where you find the policy achieving maximum reward

Value-based approach: where you find the maximum value under any policy

Model-based approach: where you build a transition model of the environment by building a lookahead model

Question 6

Q

In Atari Breakout, why do we need 4 consecutive frames?

Answer

A

4 frames are needed to contain info about ball direction, speed, acceleration, etc.

Question 7

Q

In Atari breakout, what is the output and what do they correspond to?

Answer

A

the output consists of 18 nodes that correspond to all possible positions of the joystick (left-right, up-down, 4
diagonals, neutral; plus “red button pressed”)

Question 8

Q

Give some examples of reinforcement learning

Answer

A

-> Control physical systems: walk, fly, drive, swim, …
-> Interact with users: retain customers, personalise channel, optimise user experience, …
-> Solve logistical problems: scheduling, bandwidth allocation, elevator control, cognitive radio, power optimisation, ..
-> Play games: chess, checkers, Go, Atari games, …
-> Learn sequential algorithms: attention, memory, conditionalcomputation, activations, …

Question 9

Q

What’s the difference between policy and value function

Answer

A

Policy is a behaviour function selecting actions given states a = (s)

Value function Q (s; a) is expected total reward from state s and action a under policy

Question 10

Q

What is the Bellman equation?

Answer

A

It writes the “value” of a decision problem at a certain point in time in terms of the payoff from some initial choices and the “value” of the remaining decision problem that results from those initial choices.

Question 11

Q

How does deep Q learning work? Describe in 4 steps.

Answer

A

Represent value function by deep Q-network with weights w

Define objective function by mean-squared error in Q-values

Leading to the following Q-learning gradient

Optimise objective end-to-end by SGD, using dL(w)/dw

Question 12

Q

What are some Stability Issues with Deep RL?

Answer

A

Naive Q-learning oscillates or diverges with neural nets

Data is sequential–Successive samples are correlated, non-iid
Policy changes rapidly with slight changes to Q-values–Policy may oscillate & Distribution of data can swing from one extreme to another
Scale of rewards and Q-values is unknown– Naive Q-learning gradients can be large unstable when backpropagated

Question 13

Q

How do you solve stability issues of Deep value-based reinforcement learning ?

Answer

A

Deep Q networks provides a stable solution to deep value-based RL by:

Use experience replay – Break correlations in data, bring us back to iid setting & Learn from all past policies
Freeze target Q-network by Avoiding oscillations and breaking correlations between Q-network and target
Clip rewards or normalize network adaptively to sensible range– DQN clips the rewards to [ 1; +1] – This prevents Q-values from becoming too large – Ensures gradients are well-conditioned –Can’t tell difference between small and large rewards

Question 14

Q

What is the reward in DQN in Atari? What is fixed throughout the game?

Answer

A

Reward is change in score for that step. Network architecture and hyperparameters fixed across all games

Question 15

Q

What is the advantage of normalised DQN?

Answer

A

Normalized DQN uses true (unclipped) reward signal and outputs a scalar value in the “stable” range. Output is scaled and translated into Q-values, policy and value are adapted to ensure output lies in that stable region. Network parameters w are adjusted to keep Q-values constant

Question 16

Q

What is unique about the Gorila (GOogle ReInforcement Learning Architecture)?

Answer

A

-> Parallel acting: generate new interactions
-> Distributed replay memory: save interactions
-> Parallel learning: compute gradients from replayed interactions
-> Distributed neural network: update network from gradients

Question 17

Q

Vanilla DQN is unstable when applied in parallel. How do we tackle this?

Answer

A

We use:
I Reject stale gradients
I Reject outlier gradients g > + k
I AdaGrad optimisation

Question 18

Q

How does the Deterministic Actor-Critic work?

Answer

A

Critic estimates value of current policy by Q-learning

Actor updates policy in direction that improves Q

Question 19

Q

How do Gorila and Vanilla DQN compare?

Answer

A

Using 100 parallel actors and learners

I Gorila significantly outperformed Vanilla DQN
I on 41 out of 49 Atari games

I Gorila achieved x2 score of Vanilla DQN
I on 22 out of 49 Atari games

I Gorila matched Vanilla DQN results 10x faster
I on 38 out of 49 Atari games

Question 20

Q

Name some challenges of model-based RL

Answer

A

Compounding errors

-> Errors in the transition model compound over the trajectory
-> By the end of a long trajectory, rewards can be totally wrong
-> Model-based RL has failed (so far) in Atari

Deep networks of value/policy can \plan” implicitly

-> Each layer of network performs arbitrary computational step
-> n-layer network can \lookahead” n steps
-> Are transition models required at all?

Question 21

Q

How does Go implement monte Carlo tree search?

Answer

A

Monte-Carlo search

I Monte-Carlo search (MCTS) simulates future trajectories
I Builds large lookahead search tree with millions of positions
I State-of-the-art 19 19 Go programs use MCTS
I e.g. First strong Go program MoGo

Question 22

Q

How does Go implement CNN?

Answer

A

Convolutional Networks
I 12-layer convnet trained to predict expert moves
I Raw convnet (looking at 1 position, no search at all)
I Equals performance of MoGo with 105 position search tree