Module 8 Flashcards by mustafa mohsin

What is reinforcement learning

based on rewarding desired behaviors / punishing undesired ones

How well did you know this?

Not at all

Perfectly

What is reinforcement agent capable of

is able to perceive and interpret its env to take actions and learn through trial and error

How well did you know this?

Not at all

Perfectly

Where can reinforcement learning operate?

as long as a clear reward can be applied

How well did you know this?

Not at all

Perfectly

What is optimal policy

yields highest expected utility

How well did you know this?

Not at all

Perfectly

What does a Markov decision process contain

Possible world states S
Set of Models
Set of possible actions A
reward function R(s,a)
A policy the solution of MDP

How well did you know this?

Not at all

Perfectly

What is a state in MDP

set of tokens that represent every state the agent can be in

How well did you know this?

Not at all

Perfectly

What is a model / transition model in MDP

Gives an action’s effect in a state

How well did you know this?

Not at all

Perfectly

How is the transition model defined

defined by T(S, a, S’)

- in state S, take action A, ends in State S’

How well did you know this?

Not at all

Perfectly

How does the modal differ in stochastic actions?

Add in probability P(S’| S,a) - probability of S’ given S and a

How well did you know this?

Not at all

Perfectly

What is they key feature of Markov Property

effects of an action taken in a state depend only on that state and not prior history

How well did you know this?

Not at all

Perfectly

What is an action in MDP

set of all possible action

- A(s) defines the set of actions that can eb taken given state s

How well did you know this?

Not at all

Perfectly

What is a reward in MDP

real values reward function
R(s) indicates reward for being in state s
R(s,a) indicates reward being in state s after taking action a
R(S, a, S’) indicates reward for being in state S’ from S after action A

How well did you know this?

Not at all

Perfectly

What is policy in MDP

solution to the MDP
maps from S to a
indicates action a to be taken while in state S

How well did you know this?

Not at all

Perfectly

What do MDP solutions usually involve?

dynamic programming

- recursively breaking a problem into pieces while remembering optimal solutions to each piece

How well did you know this?

Not at all

Perfectly

How is the quality measured of a policy

measured through expected utility

- denoted by pi*

How well did you know this?

Not at all

Perfectly

What is the goal of MDP and what role does RL play

Goal - maximize cumulative reward in LT
RL - transitions and rewards usually not available
- how to change policy given experience
- how to explore environment

How well did you know this?

Not at all

Perfectly

Describe Episodic vs continuing tasks in MDP (optimality/horizon)?

Episodic

finite horzion = game ends after N steps
optimal policy depends on N - harder to analyze
Policy depends on time = nonstationary

Continuing tasks

infinite horizon = no time limit
optimal action depends on current state and is stationary

How well did you know this?

Not at all

Perfectly

What are additive rewards

Study These Flashcards

infinite value for continuing tasks

what are discontinued rewards

Study These Flashcards

where y is 0 < 1 - discount factor describes preference if an agent for current rewards over future rewards

where y is close to 0 - rewards in distant future are insignificant

where y is close to 1 - agent is more willing to wait for long-term rewards

when y is exactly 1 - discounted rewards reduce to the special case of purely additive rewards

What is the utility of the state

Study These Flashcards

expected reward for next transition + discounted utility of next stare assuming agent chooses optimal solution
given by bellman equation

What is the state value function

Study These Flashcards

denoted - U pi (s)

- expected return when starting in s and following pi

What is the state-action value function

Study These Flashcards

denoted - Q pi (s, a) AKA Q funtion

- expected return when starting in s, performing a and following pi

What are value functions useful for

Study These Flashcards

useful for finding the optimal policy

can est from experience
pick the best action using Q function

How does RL differ from MDP

Study These Flashcards

Don’t know transition model T or Reward R

- must try actions and states to learn

What are the basic ideas of RL

Exploration - you have to try unknown actions to get info Exploitation - you have to use what you know Sampling - you may need to repeat many times to get good estimates Generalization - what you learn in one state may apply to others

How do you receive feedback in RL, What is agents utility in RL

- receive feedback in form of rewards | - utility is reward function

Offline vs Online

Offline is MDP online is RL

What is the idea of model based learning

- the agent uses the transition model of the environment to make decisions - Assumes learned model is correct - learn an appx model based on experiences

What is step 1 of MBL

Learn empircal MDP model - count outcomes s' for each s, a - normalize to give estimate of T(s,a,s') - Discover each reward when we experience (S, a, s')

What is step 2 of MBL

- solve learned MDP

Pros and Cons of MBL

Pro - makes efficient use of experiences Con - may not scale to large state spaces - learns model one state-action pair at a time

What is the simplified task of passive reinforcement learning

``` Policy evaluation - Input - a fixed policy pi(s) - agent tries to learn the utility pi (s) - Don't know transition model and reward the goal is to learn state values ```

Run through passive reinforcement learning

1. Agent executes set of trials in env using policy 2. Agent started in the initial state and reaches one of the terminal states 3. Agent percepts supply current state and reward for transition to reach that state

What is the utility in Passive reinforcement learning

- is defined as expected sum of (discounted ) rewards obtained if policy followed

What is the goal of direct evaluation

Compute values for each state under pi

What is the idea of Direct Evaluation

Average observed sample values - Act according to pi - write down sum of discounted rewards when visiting state - average those samples

What does direct evaluation do to RL

reduces it to standard supervised learning with state and reward pair

Pros and cons of direct evaluation

Pros - Easy to understand - No knowledge of T, R required - eventually computes correct avg values using just sample transitions Cons - wastes information about state connections - each state learned separately - violates Bellman equations - slow

Why not use policy evaluation

Still need T and R although it exploits the connection between states

What is the idea of sample based policy evaluation

take samples of outcomes by doing the action and then average

What is the idea of Temporal difference learning TDL

- update U each time we experience transition | - likely outcomes will contribute updates more often

How does TDL learn

- Policy is still fixed , still evaluating | - Take the running average - move values toward successor value

Problems of TD value learning and solution

- cannot turn values into new policy | - solution - learn Q values , make action selection model free

Known MDP vs Unknown MDP

Known MDP - offline solution - policy evaluation Unkown MDP model based - fixed policy evaluated on appx MDP Unkown MDP model free - Evaluate fixed policy at value learning - Q learning - PRL - DE - TDL

Module 8 Flashcards

(44 cards)