Reinforcement Learning Flashcards by Joshua Rosenthal

What is reinforcement learning?

Operant conditioning

An agent interacting with its environment by building up an internal model through trial and error

How well did you know this?

Not at all

Perfectly

Components of an RL agent?

State
Transition
Action
Reward

How well did you know this?

Not at all

Perfectly

Whats the difference between deterministic and stochastic actions?

Deterministic - choice action will always occur

Stochastic - choice action has a chance to occur where probability represents uncertainty

How well did you know this?

Not at all

Perfectly

How do we represent solutions?

Using a policy: mapping from states to actions

How well did you know this?

Not at all

Perfectly

How do we evaluate policies?

Deterministic: Sum total rewards for following a policy
Stochastic: sum expected rewards for following a policy

How well did you know this?

Not at all

Perfectly

What are the two RL algorithm types? Describe them.

Model based - agent knows STAR

2. Model free - agent does not know T and R and must learn them through trial and error

How well did you know this?

Not at all

Perfectly

Explain model based RL.

Agent is given:
All STATES in the environment
Set of all ACTIONS in each state
TRANSITION probabilities between s and s' given a
REWARD for each action in each state

How well did you know this?

Not at all

Perfectly

Explain model free RL.

Model tries different actions in different states to build an estimate of TRANSITION probabilities and REWARDS for performing actions.

How well did you know this?

Not at all

Perfectly

Exploration vs Exploitation

Exploration: Finds more info about environment
Exploitation: Uses known info to maximise reward

How well did you know this?

Not at all

Perfectly

Passive vs Active RL

Passive: agent executes a fixed policy then evaluates it
Active: agent updates a policy as it learns

How well did you know this?

Not at all

Perfectly

Fully vs Partially Observable Environments

Fully: agent is initialised with state information and reward transition functions - knows current state, actions to transition to next state and reward for doing so

Partially: Agent has an internal model of the environment that it refines through trial and error where it can better learn states and transition functions

How well did you know this?

Not at all

Perfectly

What is a Markov Decision Process

A model of an environment that consists of:

finite # of states
probabilistic transitions between states
possible actions at a state
rewards for performing a specific action in a specific state

How well did you know this?

Not at all

Perfectly

What is the Markov Property?

A Markov process is a stochastic process who’s future state only depends on the current state and current action – not on past states/actions

Future is independent of the past given the present

How well did you know this?

Not at all

Perfectly

What are the 4 types of Markov Models

MDP - control over state transitions, fully observable
POMDP - control over state transition, partially observable
Markov Chain
HMM

Note that MDP and POMDP describe deterministic actions. The other 2 are the stochastic versions of the first

How well did you know this?

Not at all

Perfectly

What is the purpose of Gamma?

Discount rewards
We can control how much the agent cares about the future
Gamma close to 1 => agent cares a lot about the distant future
Hyper parameter

How well did you know this?

Not at all

Perfectly

What are some of the advantages of value iteration?

Study These Flashcards

Will converge

Good for small number of states

What are some disadvantages of value iteration?

Study These Flashcards

Slow to converge for many states

Knowing T and R is unrealistic

How do you perform value iteration?

Study These Flashcards

Initialise the value of each state to zero and iteratively refine V(s) at each time step as it converges to V*(S)

Stope when the difference between the values of 2 consecutive iterations is below some threshold theta

What is Direct Evaluation?

Study These Flashcards

Model free RL

Every time you visit a state write down what the sum of rewards turn out to be then average those

Does this for every state to estimate T and R

can then start exploiting

How do you perform direct evaluation?

Study These Flashcards

Choose a starting state
Choose an action
Set every state traversed to -1 to ensure no backtracking
Average the sum of each value in a policy associated with a particular starting state every terminal state
Repeat for all states

That value will be the estimated R of starting state S

What are some advantages of direct evaluation?

Study These Flashcards

Easy to understand
Do not need T and R
Eventually computes correct state values

What are some of the disadvantages of direct evaluation?

Study These Flashcards

Takes long

Each state learned separately

What is policy iteration?

Study These Flashcards

Model - based learning

Initialise, Evaluate, Improve

How to perform policy iteration?

Study These Flashcards

Initialise an arbitrary policy - evaluate it using bellman’s (value iteration) - check if it can be improved by checking that the best action for every state is the one we currently have in our policy

What are some advantages of policy iteration?

Converges on an optimal policy

What are some disadvantages of policy iteration?

Slow.

What is Q learning?

``` Model Free RL POMDP In every iteration: 1. agent observes state s 2. takes action a 3. observes new state s' 4. receives reward r ```

What is the basis of Q learning? Explain this concept.

Temporal credit assignment. Agent takes a path that maximises total future reward i.e. the agent will not follow a greedy policy - balances current and future rewards.

What is temporal difference learning?

Basis of Q learning | Adjust the estimated value of a state based on immediate reward and the estimated value of the next state

How do you Q learn?

1. Init q table = 0 2. observe current state s 3. loop a. select and action b. receive immediate reward and observe new state s' c. update Q table using formula.

What are some Q learning properties?

Converges if: Enough exploration Exploits over time learning rate between 0 and 1 Please read over methods of changing learning rate

What does learning rate do? Q learning

Impacts how drastically Q value of current state changes

How do we choose actions in Q learning?

Greedy Epsilon approach Epsilon determines the likelihood we take a random action Over time e must decrease

What are the three types of learning in multi-agent systems?

Cooperation Competition Individual

Reinforcement Learning Flashcards

(34 cards)