Module 8 Flashcards

1
Q

What is reinforcement learning

A
  • based on rewarding desired behaviors / punishing undesired ones
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is reinforcement agent capable of

A
  • is able to perceive and interpret its env to take actions and learn through trial and error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Where can reinforcement learning operate?

A

as long as a clear reward can be applied

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is optimal policy

A
  • yields highest expected utility
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does a Markov decision process contain

A
  • Possible world states S
  • Set of Models
  • Set of possible actions A
  • reward function R(s,a)
  • A policy the solution of MDP
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a state in MDP

A
  • set of tokens that represent every state the agent can be in
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a model / transition model in MDP

A
  • Gives an action’s effect in a state
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is the transition model defined

A
  • defined by T(S, a, S’)

- in state S, take action A, ends in State S’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does the modal differ in stochastic actions?

A

Add in probability P(S’| S,a) - probability of S’ given S and a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is they key feature of Markov Property

A

effects of an action taken in a state depend only on that state and not prior history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an action in MDP

A
  • set of all possible action

- A(s) defines the set of actions that can eb taken given state s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a reward in MDP

A
  • real values reward function
  • R(s) indicates reward for being in state s
  • R(s,a) indicates reward being in state s after taking action a
  • R(S, a, S’) indicates reward for being in state S’ from S after action A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is policy in MDP

A
  • solution to the MDP
  • maps from S to a
  • indicates action a to be taken while in state S
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do MDP solutions usually involve?

A

dynamic programming

- recursively breaking a problem into pieces while remembering optimal solutions to each piece

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is the quality measured of a policy

A
  • measured through expected utility

- denoted by pi*

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the goal of MDP and what role does RL play

A

Goal - maximize cumulative reward in LT
RL - transitions and rewards usually not available
- how to change policy given experience
- how to explore environment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe Episodic vs continuing tasks in MDP (optimality/horizon)?

A

Episodic

  • finite horzion = game ends after N steps
  • optimal policy depends on N - harder to analyze
  • Policy depends on time = nonstationary

Continuing tasks

  • infinite horizon = no time limit
  • optimal action depends on current state and is stationary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are additive rewards

A
  • infinite value for continuing tasks
19
Q

what are discontinued rewards

A

where y is 0 < 1 - discount factor describes preference if an agent for current rewards over future rewards

where y is close to 0 - rewards in distant future are insignificant

where y is close to 1 - agent is more willing to wait for long-term rewards

when y is exactly 1 - discounted rewards reduce to the special case of purely additive rewards

20
Q

What is the utility of the state

A
  • expected reward for next transition + discounted utility of next stare assuming agent chooses optimal solution
  • given by bellman equation
21
Q

What is the state value function

A

denoted - U pi (s)

- expected return when starting in s and following pi

22
Q

What is the state-action value function

A

denoted - Q pi (s, a) AKA Q funtion

- expected return when starting in s, performing a and following pi

23
Q

What are value functions useful for

A

useful for finding the optimal policy

  • can est from experience
  • pick the best action using Q function
24
Q

How does RL differ from MDP

A
  • Don’t know transition model T or Reward R

- must try actions and states to learn

25
Q

What are the basic ideas of RL

A

Exploration - you have to try unknown actions to get info
Exploitation - you have to use what you know
Sampling - you may need to repeat many times to get good estimates
Generalization - what you learn in one state may apply to others

26
Q

How do you receive feedback in RL, What is agents utility in RL

A
  • receive feedback in form of rewards

- utility is reward function

27
Q

Offline vs Online

A

Offline is MDP online is RL

28
Q

What is the idea of model based learning

A
  • the agent uses the transition model of the environment to make decisions
  • Assumes learned model is correct
  • learn an appx model based on experiences
29
Q

What is step 1 of MBL

A

Learn empircal MDP model

  • count outcomes s’ for each s, a
  • normalize to give estimate of T(s,a,s’)
  • Discover each reward when we experience (S, a, s’)
30
Q

What is step 2 of MBL

A
  • solve learned MDP
31
Q

Pros and Cons of MBL

A

Pro - makes efficient use of experiences
Con - may not scale to large state spaces
- learns model one state-action pair at a time

32
Q

What is the simplified task of passive reinforcement learning

A
Policy evaluation
- Input - a fixed policy pi(s)
- agent tries to learn the utility  pi (s)
- Don't know transition model and reward
the goal is to learn state values
33
Q

Run through passive reinforcement learning

A
  1. Agent executes set of trials in env using policy
  2. Agent started in the initial state and reaches one of the terminal states
  3. Agent percepts supply current state and reward for transition to reach that state
34
Q

What is the utility in Passive reinforcement learning

A
  • is defined as expected sum of (discounted ) rewards obtained if policy followed
35
Q

What is the goal of direct evaluation

A

Compute values for each state under pi

36
Q

What is the idea of Direct Evaluation

A

Average observed sample values

  • Act according to pi
  • write down sum of discounted rewards when visiting state
  • average those samples
37
Q

What does direct evaluation do to RL

A

reduces it to standard supervised learning with state and reward pair

38
Q

Pros and cons of direct evaluation

A

Pros

  • Easy to understand
  • No knowledge of T, R required
  • eventually computes correct avg values using just sample transitions

Cons

  • wastes information about state connections
  • each state learned separately
  • violates Bellman equations
  • slow
39
Q

Why not use policy evaluation

A

Still need T and R although it exploits the connection between states

40
Q

What is the idea of sample based policy evaluation

A

take samples of outcomes by doing the action and then average

41
Q

What is the idea of Temporal difference learning TDL

A
  • update U each time we experience transition

- likely outcomes will contribute updates more often

42
Q

How does TDL learn

A
  • Policy is still fixed , still evaluating

- Take the running average - move values toward successor value

43
Q

Problems of TD value learning and solution

A
  • cannot turn values into new policy

- solution - learn Q values , make action selection model free

44
Q

Known MDP vs Unknown MDP

A

Known MDP

  • offline solution
  • policy evaluation

Unkown MDP model based
- fixed policy evaluated on appx MDP

Unkown MDP model free

  • Evaluate fixed policy at value learning
  • Q learning
  • PRL
  • DE
  • TDL