Reinforcement Learning Flashcards

1
Q

In the context of reinforcement learning, what is the gamma in the value function equation?

A

It is known as discount factor, and it determines how much the reinforcement learning agents cares about rewards in the distant future relative to those in the immediate future.

If γ=0, the agent will be completely myopic and only learn about actions that produce an immediate reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the advantage of a discount factor?

A

It firstly makes sure that the whole expression won’t converge and
to reflect that future rewards are less and less predictable so we cannot just sum them up
and third that it should be a finite number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What values can the discount factor take and what does it tell us?

A

Gamma is usually between 0.9 and 0.99 depending on how stable the system is. A discount rate γ<1 ensures a converging geometric series of rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the principle of reinforcement learning?

A

Given the assumptions, find the optimal policy that would maximize the reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three approaches to RL?

A

Policy-based approach: where you find the policy achieving maximum reward

Value-based approach: where you find the maximum value under any policy

Model-based approach: where you build a transition model of the environment by building a lookahead model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In Atari Breakout, why do we need 4 consecutive frames?

A

4 frames are needed to contain info about ball direction, speed, acceleration, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In Atari breakout, what is the output and what do they correspond to?

A

the output consists of 18 nodes that correspond to all possible positions of the joystick (left-right, up-down, 4
diagonals, neutral; plus “red button pressed”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give some examples of reinforcement learning

A
  • -> Control physical systems: walk, fly, drive, swim, …
  • -> Interact with users: retain customers, personalise channel, optimise user experience, …
  • -> Solve logistical problems: scheduling, bandwidth allocation, elevator control, cognitive radio, power optimisation, ..
  • -> Play games: chess, checkers, Go, Atari games, …
  • -> Learn sequential algorithms: attention, memory, conditionalcomputation, activations, …
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What’s the difference between policy and value function

A

Policy is a behaviour function selecting actions given states a =  (s)

Value function Q (s; a) is expected total reward from state s and action a under policy  

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Bellman equation?

A

It writes the “value” of a decision problem at a certain point in time in terms of the payoff from some initial choices and the “value” of the remaining decision problem that results from those initial choices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does deep Q learning work? Describe in 4 steps.

A

Represent value function by deep Q-network with weights w

Define objective function by mean-squared error in Q-values

Leading to the following Q-learning gradient

Optimise objective end-to-end by SGD, using dL(w)/dw

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some Stability Issues with Deep RL?

A

Naive Q-learning oscillates or diverges with neural nets

  1. Data is sequential–Successive samples are correlated, non-iid
  2. Policy changes rapidly with slight changes to Q-values–Policy may oscillate & Distribution of data can swing from one extreme to another
  3. Scale of rewards and Q-values is unknown– Naive Q-learning gradients can be large unstable when backpropagated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you solve stability issues of Deep value-based reinforcement learning ?

A

Deep Q networks provides a stable solution to deep value-based RL by:

  1. Use experience replay – Break correlations in data, bring us back to iid setting & Learn from all past policies
  2. Freeze target Q-network by Avoiding oscillations and breaking correlations between Q-network and target
  3. Clip rewards or normalize network adaptively to sensible range– DQN clips the rewards to [  1; +1] – This prevents Q-values from becoming too large – Ensures gradients are well-conditioned –Can’t tell difference between small and large rewards
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the reward in DQN in Atari? What is fixed throughout the game?

A

Reward is change in score for that step. Network architecture and hyperparameters  fixed across all games

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the advantage of normalised DQN?

A

Normalized DQN uses true (unclipped) reward signal and outputs a scalar value in the “stable” range. Output is scaled and translated into Q-values,  policy and value are adapted to ensure output lies in that stable region. Network parameters w are adjusted to keep Q-values constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is unique about the Gorila (GOogle ReInforcement Learning Architecture)?

A
  • -> Parallel acting: generate new interactions
  • -> Distributed replay memory: save interactions
  • -> Parallel learning: compute gradients from replayed interactions
  • -> Distributed neural network: update network from gradients
17
Q

Vanilla DQN is unstable when applied in parallel. How do we tackle this?

A

We use:
I Reject stale gradients
I Reject outlier gradients g >   + k 
I AdaGrad optimisation

18
Q

How does the Deterministic Actor-Critic work?

A

Critic estimates value of current policy by Q-learning

Actor updates policy in direction that improves Q

19
Q

How do Gorila and Vanilla DQN compare?

A

Using 100 parallel actors and learners

I Gorila significantly outperformed Vanilla DQN
I on 41 out of 49 Atari games

I Gorila achieved x2 score of Vanilla DQN
I on 22 out of 49 Atari games

I Gorila matched Vanilla DQN results 10x faster
I on 38 out of 49 Atari games

20
Q

Name some challenges of model-based RL

A

Compounding errors

  • -> Errors in the transition model compound over the trajectory
  • -> By the end of a long trajectory, rewards can be totally wrong
  • -> Model-based RL has failed (so far) in Atari

Deep networks of value/policy can \plan” implicitly

  • -> Each layer of network performs arbitrary computational step
  • -> n-layer network can \lookahead” n steps
  • -> Are transition models required at all?
21
Q

How does Go implement monte Carlo tree search?

A

Monte-Carlo search

I Monte-Carlo search (MCTS) simulates future trajectories
I Builds large lookahead search tree with millions of positions
I State-of-the-art 19   19 Go programs use MCTS
I e.g. First strong Go program MoGo

22
Q

How does Go implement CNN?

A

Convolutional Networks
I 12-layer convnet trained to predict expert moves
I Raw convnet (looking at 1 position, no search at all)
I Equals performance of MoGo with 105 position search tree