Intro - Lectures 1 and 2 Flashcards
What are the three types of machine learning and how are they structured/represented?
Supervised Learning - Function approximation; given y = f(x), find y given x
Unsupervised Learning - Clustering or description; find f in f(x)
Reinforcement Learning - Superficially looks like supervised learning; a method for decision making. Instead of having x and y are given some other value z and need to learn y and f for y = f(x)
What are the components of a Markov Decision Process and some forms they can take?
States: S Model: T(s,a,s') ~ Pr(s' | s,a) Actions: A(s), A Reward: R(s), R(s,a), R(s,a,s') Policy: Pi(s) -> a
What is the Markovian property?
Only the present matters. Your next state is only dependent on your current state (how you got there doesn’t matter)
How can you get around the Markovian property when past actions/states do matter?
Include all necessary/relevant past information in the current state
What is the solution to an MDP?
The policy function, maps states to actions, Pi(s) -> a
What is the MDP policy Pi*
Pi* is the optimal policy to maximize long-term rewards
What is the difference between planning and RL policy?
Planning aims to develop a concrete (multi-action) plan to achieve an objective. RL policy asks “in each state what action should I take now?”
What is one issue with delayed rewards in MDPs? Hint this problem has a name.
Minor changes matter and we must determine which states and actions resulted in the outcomes we saw. This is referred to as the (temporal) credit assignment problem.
What assumptions are made in the sequence of rewards for MDPs?
Infinite horizon (stationarity)
Utility of sequences e.g. if
U(s0,s1,s2,…) > U(s0,s’1,s’2,…) then
U(s1,s2,…) > U(s’1,s’2,…)
What is the purpose of gamma in an MDP
Gamma is the discount rate [0.0, 1.0) and used to guarantee convergence of the infinite sequence of rewards
What is the bounded sum of rewards for an MDP given the max reward R_max and the discount rate gamma?
R_max/(1-gamma)
What is the difference between utility and reward?
Utility is based on the long term expected value of an action
Reward is the immediate impact from making an action
What is the Bellman equation?
The Bellman equation describes the utility of a state in a discounted MDP
U(s) = R(s) + gamma * max_a Sum_s’ [T(s,a,s’)U(s’)]
How can we solve Bellman’s equation directly?
Value Iteration, Policy Iteration
How do you perform value iteration?
Start with arbitrary utilities
Update utilities based on neighbors using Bellman’s equation
Repeat until convergence