Reinforcement Learning Flashcards

1
Q

What is reinforcement learning?

A

Operant conditioning

An agent interacting with its environment by building up an internal model through trial and error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Components of an RL agent?

A

State
Transition
Action
Reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Whats the difference between deterministic and stochastic actions?

A

Deterministic - choice action will always occur

Stochastic - choice action has a chance to occur where probability represents uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do we represent solutions?

A

Using a policy: mapping from states to actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do we evaluate policies?

A

Deterministic: Sum total rewards for following a policy
Stochastic: sum expected rewards for following a policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two RL algorithm types? Describe them.

A
  1. Model based - agent knows STAR

2. Model free - agent does not know T and R and must learn them through trial and error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain model based RL.

A
Agent is given:
All STATES in the environment
Set of all ACTIONS in each state
TRANSITION probabilities between s and s' given a
REWARD for each action in each state
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain model free RL.

A

Model tries different actions in different states to build an estimate of TRANSITION probabilities and REWARDS for performing actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Exploration vs Exploitation

A

Exploration: Finds more info about environment
Exploitation: Uses known info to maximise reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Passive vs Active RL

A

Passive: agent executes a fixed policy then evaluates it
Active: agent updates a policy as it learns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Fully vs Partially Observable Environments

A

Fully: agent is initialised with state information and reward transition functions - knows current state, actions to transition to next state and reward for doing so

Partially: Agent has an internal model of the environment that it refines through trial and error where it can better learn states and transition functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Markov Decision Process

A

A model of an environment that consists of:

  1. finite # of states
  2. probabilistic transitions between states
  3. possible actions at a state
  4. rewards for performing a specific action in a specific state
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the Markov Property?

A

A Markov process is a stochastic process who’s future state only depends on the current state and current action – not on past states/actions

Future is independent of the past given the present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 4 types of Markov Models

A
  1. MDP - control over state transitions, fully observable
  2. POMDP - control over state transition, partially observable
  3. Markov Chain
  4. HMM

Note that MDP and POMDP describe deterministic actions. The other 2 are the stochastic versions of the first

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of Gamma?

A

Discount rewards
We can control how much the agent cares about the future
Gamma close to 1 => agent cares a lot about the distant future
Hyper parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some of the advantages of value iteration?

A

Will converge

Good for small number of states

17
Q

What are some disadvantages of value iteration?

A

Slow to converge for many states

Knowing T and R is unrealistic

18
Q

How do you perform value iteration?

A

Initialise the value of each state to zero and iteratively refine V(s) at each time step as it converges to V*(S)

Stope when the difference between the values of 2 consecutive iterations is below some threshold theta

19
Q

What is Direct Evaluation?

A

Model free RL

Every time you visit a state write down what the sum of rewards turn out to be then average those

Does this for every state to estimate T and R

can then start exploiting

20
Q

How do you perform direct evaluation?

A
  1. Choose a starting state
  2. Choose an action
  3. Set every state traversed to -1 to ensure no backtracking
  4. Average the sum of each value in a policy associated with a particular starting state every terminal state
  5. Repeat for all states

That value will be the estimated R of starting state S

21
Q

What are some advantages of direct evaluation?

A

Easy to understand
Do not need T and R
Eventually computes correct state values

22
Q

What are some of the disadvantages of direct evaluation?

A

Takes long

Each state learned separately

23
Q

What is policy iteration?

A

Model - based learning

Initialise, Evaluate, Improve

24
Q

How to perform policy iteration?

A

Initialise an arbitrary policy - evaluate it using bellman’s (value iteration) - check if it can be improved by checking that the best action for every state is the one we currently have in our policy

25
What are some advantages of policy iteration?
Converges on an optimal policy
26
What are some disadvantages of policy iteration?
Slow.
27
What is Q learning?
``` Model Free RL POMDP In every iteration: 1. agent observes state s 2. takes action a 3. observes new state s' 4. receives reward r ```
28
What is the basis of Q learning? Explain this concept.
Temporal credit assignment. Agent takes a path that maximises total future reward i.e. the agent will not follow a greedy policy - balances current and future rewards.
29
What is temporal difference learning?
Basis of Q learning | Adjust the estimated value of a state based on immediate reward and the estimated value of the next state
30
How do you Q learn?
1. Init q table = 0 2. observe current state s 3. loop a. select and action b. receive immediate reward and observe new state s' c. update Q table using formula.
31
What are some Q learning properties?
Converges if: Enough exploration Exploits over time learning rate between 0 and 1 Please read over methods of changing learning rate
32
What does learning rate do? Q learning
Impacts how drastically Q value of current state changes
33
How do we choose actions in Q learning?
Greedy Epsilon approach Epsilon determines the likelihood we take a random action Over time e must decrease
34
What are the three types of learning in multi-agent systems?
Cooperation Competition Individual