Intro - Lectures 1 and 2 Flashcards

Question 1

Q

What are the three types of machine learning and how are they structured/represented?

Answer

A

Supervised Learning - Function approximation; given y = f(x), find y given x
Unsupervised Learning - Clustering or description; find f in f(x)
Reinforcement Learning - Superficially looks like supervised learning; a method for decision making. Instead of having x and y are given some other value z and need to learn y and f for y = f(x)

Question 2

Q

What are the components of a Markov Decision Process and some forms they can take?

Answer

A

States: S
Model: T(s,a,s') ~ Pr(s' | s,a)
Actions: A(s), A
Reward: R(s), R(s,a), R(s,a,s')
Policy: Pi(s) -> a

Question 3

Q

What is the Markovian property?

Answer

A

Only the present matters. Your next state is only dependent on your current state (how you got there doesn’t matter)

Question 4

Q

How can you get around the Markovian property when past actions/states do matter?

Answer

A

Include all necessary/relevant past information in the current state

Question 5

Q

What is the solution to an MDP?

Answer

A

The policy function, maps states to actions, Pi(s) -> a

Question 6

Q

What is the MDP policy Pi*

Answer

A

Pi* is the optimal policy to maximize long-term rewards

Question 7

Q

What is the difference between planning and RL policy?

Answer

A

Planning aims to develop a concrete (multi-action) plan to achieve an objective. RL policy asks “in each state what action should I take now?”

Question 8

Q

What is one issue with delayed rewards in MDPs? Hint this problem has a name.

Answer

A

Minor changes matter and we must determine which states and actions resulted in the outcomes we saw. This is referred to as the (temporal) credit assignment problem.

Question 9

Q

What assumptions are made in the sequence of rewards for MDPs?

Answer

A

Infinite horizon (stationarity)
Utility of sequences e.g. if
U(s0,s1,s2,…) > U(s0,s’1,s’2,…) then
U(s1,s2,…) > U(s’1,s’2,…)

Question 10

Q

What is the purpose of gamma in an MDP

Answer

A

Gamma is the discount rate [0.0, 1.0) and used to guarantee convergence of the infinite sequence of rewards

Question 11

Q

What is the bounded sum of rewards for an MDP given the max reward R_max and the discount rate gamma?

Answer

A

R_max/(1-gamma)

Question 12

Q

What is the difference between utility and reward?

Answer

A

Utility is based on the long term expected value of an action
Reward is the immediate impact from making an action

Question 13

Q

What is the Bellman equation?

Answer

A

The Bellman equation describes the utility of a state in a discounted MDP

U(s) = R(s) + gamma * max_a Sum_s’ [T(s,a,s’)U(s’)]

Question 14

Q

How can we solve Bellman’s equation directly?

Answer

A

Value Iteration, Policy Iteration

Question 15

Q

How do you perform value iteration?

Answer

A

Start with arbitrary utilities
Update utilities based on neighbors using Bellman’s equation
Repeat until convergence

Question 16

Q

How do you perform policy iteration?

Answer

Study These Flashcards

A

Start - Initialize an arbitrary policy Pi_0
Evaluate - For a timestep t given Pi_t calculate U_t using Bellman’ equation which gives n equations in n unknowns
Improve - Pi_t+1 = argmax_a Sum[T(s,a,s’)U_t(s’)]

Question 17

Q

Compare policy iteration to value iteration

Answer

Study These Flashcards

A

Policy iteration is more expensive per iteration (matrix inversion of nxn matrix) but may require less iterations
Both are guaranteed to converge

Question 18

Q

What is the relationship between the three forms of the Bellman equation?

Answer

Study These Flashcards

A

Watch Lession 1.29

Question 19

Q

In a RL context the agent acts as ___ and the environment acts as ___

Answer

Study These Flashcards

A

policy (pi); the MDP

Question 20

Q

What types of behavior can AI agents learn?

Answer

Study These Flashcards

A

Plan - Fixed sequence of actions (stochasticity causes problems)*
Conditional Plan - A plan with if/else statements (can handle some stochasticity)
Stationary Policy/Universal Plan - Mapping from state to action for every state (can handle any stochasticity, but very large)

*Dynamic replanning - Recreate a new plan if something goes wrong

Question 21

Q

How do we evaluate a learner?

Answer

Study These Flashcards

A

Value of returned policy
Computational complexity (how much time needed)
Experience/sample complexity (how much data needed)

Intro - Lectures 1 and 2 Flashcards

(21 cards)