Reinforcement Learning Flashcards

1
Q

What is reinforcement learning?

A

Operant conditioning

An agent interacting with its environment by building up an internal model through trial and error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Components of an RL agent?

A

State
Transition
Action
Reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Whats the difference between deterministic and stochastic actions?

A

Deterministic - choice action will always occur

Stochastic - choice action has a chance to occur where probability represents uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do we represent solutions?

A

Using a policy: mapping from states to actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do we evaluate policies?

A

Deterministic: Sum total rewards for following a policy
Stochastic: sum expected rewards for following a policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two RL algorithm types? Describe them.

A
  1. Model based - agent knows STAR

2. Model free - agent does not know T and R and must learn them through trial and error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain model based RL.

A
Agent is given:
All STATES in the environment
Set of all ACTIONS in each state
TRANSITION probabilities between s and s' given a
REWARD for each action in each state
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain model free RL.

A

Model tries different actions in different states to build an estimate of TRANSITION probabilities and REWARDS for performing actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Exploration vs Exploitation

A

Exploration: Finds more info about environment
Exploitation: Uses known info to maximise reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Passive vs Active RL

A

Passive: agent executes a fixed policy then evaluates it
Active: agent updates a policy as it learns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Fully vs Partially Observable Environments

A

Fully: agent is initialised with state information and reward transition functions - knows current state, actions to transition to next state and reward for doing so

Partially: Agent has an internal model of the environment that it refines through trial and error where it can better learn states and transition functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Markov Decision Process

A

A model of an environment that consists of:

  1. finite # of states
  2. probabilistic transitions between states
  3. possible actions at a state
  4. rewards for performing a specific action in a specific state
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the Markov Property?

A

A Markov process is a stochastic process who’s future state only depends on the current state and current action – not on past states/actions

Future is independent of the past given the present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 4 types of Markov Models

A
  1. MDP - control over state transitions, fully observable
  2. POMDP - control over state transition, partially observable
  3. Markov Chain
  4. HMM

Note that MDP and POMDP describe deterministic actions. The other 2 are the stochastic versions of the first

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of Gamma?

A

Discount rewards
We can control how much the agent cares about the future
Gamma close to 1 => agent cares a lot about the distant future
Hyper parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some of the advantages of value iteration?

A

Will converge

Good for small number of states

17
Q

What are some disadvantages of value iteration?

A

Slow to converge for many states

Knowing T and R is unrealistic

18
Q

How do you perform value iteration?

A

Initialise the value of each state to zero and iteratively refine V(s) at each time step as it converges to V*(S)

Stope when the difference between the values of 2 consecutive iterations is below some threshold theta

19
Q

What is Direct Evaluation?

A

Model free RL

Every time you visit a state write down what the sum of rewards turn out to be then average those

Does this for every state to estimate T and R

can then start exploiting

20
Q

How do you perform direct evaluation?

A
  1. Choose a starting state
  2. Choose an action
  3. Set every state traversed to -1 to ensure no backtracking
  4. Average the sum of each value in a policy associated with a particular starting state every terminal state
  5. Repeat for all states

That value will be the estimated R of starting state S

21
Q

What are some advantages of direct evaluation?

A

Easy to understand
Do not need T and R
Eventually computes correct state values

22
Q

What are some of the disadvantages of direct evaluation?

A

Takes long

Each state learned separately

23
Q

What is policy iteration?

A

Model - based learning

Initialise, Evaluate, Improve

24
Q

How to perform policy iteration?

A

Initialise an arbitrary policy - evaluate it using bellman’s (value iteration) - check if it can be improved by checking that the best action for every state is the one we currently have in our policy

25
Q

What are some advantages of policy iteration?

A

Converges on an optimal policy

26
Q

What are some disadvantages of policy iteration?

A

Slow.

27
Q

What is Q learning?

A
Model Free RL
POMDP
In every iteration:
1. agent observes state s
2. takes action a
3. observes new state s'
4. receives reward r
28
Q

What is the basis of Q learning? Explain this concept.

A

Temporal credit assignment. Agent takes a path that maximises total future reward i.e. the agent will not follow a greedy policy - balances current and future rewards.

29
Q

What is temporal difference learning?

A

Basis of Q learning

Adjust the estimated value of a state based on immediate reward and the estimated value of the next state

30
Q

How do you Q learn?

A
  1. Init q table = 0
  2. observe current state s
  3. loop
    a. select and action
    b. receive immediate reward and observe new state s’
    c. update Q table using formula.
31
Q

What are some Q learning properties?

A

Converges if:
Enough exploration
Exploits over time
learning rate between 0 and 1

Please read over methods of changing learning rate

32
Q

What does learning rate do? Q learning

A

Impacts how drastically Q value of current state changes

33
Q

How do we choose actions in Q learning?

A

Greedy Epsilon approach
Epsilon determines the likelihood we take a random action
Over time e must decrease

34
Q

What are the three types of learning in multi-agent systems?

A

Cooperation
Competition
Individual