2 - MDP Planning Flashcards
What is policy iteration?
In policy iteration algorithms, you start with a random policy, then find the value function of that policy (policy evaluation step), then find a new (improved) policy based on the previous value function, and so on. In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Given a policy, its value function can be obtained using the *Bellman operator*
What is value iteration?
In value iteration, you start with a random value function and then find a new (improved) value function in an iterative process, until reaching the optimal value function. Notice that you can derive easily the optimal policy from the optimal value function. This process is based on the *optimality Bellman operator*. The optimality Bellman operator contains a maxoperator, which is non linear and, therefore, it has different features.
What does P(s’ | s,a) stand for?
This is the Transition Probability. It indicates the probability of reaching state s’ from state s given that you take the action a.
What does P(r | s,a) stand for?
This is the Reward Probability. It indicates the chance of getting reward R given that you take an action a on state s.
What is pi(s) and what does it stand for?
pi(s): states s –> actions (a) This is the Policy that can either be deterministic or stochastic.
How do we mathematically represent choosing the most optimal pi(s)?
pi(s) = Max(V^pi) In otherwise, to find the optimal policy pi, we maximize the Value at that policy.
How do we represent a *set* of states and actions that *lead to* a desired end state?
We can represent our trajectory using Tau = (S0, A, S1, A, S2, A…etc)
How do we calculate the Value V of a policy?
We sum the product of our rewards from t=0 to infinity (or our horizon) with the corresponding discount factor gamma (a value between 0-1) that is raised to a power t, for a given policy. Because these rewards are stochastic then we also take the expected value of this sum of rewards. Thus, V_pi = E [Summation (gamma^t * Reward_t)]
What is a Discount Factor and how do we represent it?
A discount factor is the exponentially decreasing weight we give to rewards farther and farther into the future. It is represented by gamma raised to the power of t (our time step). It is usually a value between 0 and 1.
What does the Q stand for in Q-Learning and what does it mean?
It stands for “Quality”. The optimal Q function, Q*(s,a) means the expected total reward received by an agent starting in s, picking a, and then behaving optimally afterwards. It is an indication for how good it is for an agent to pick an action while being in state s.
How do we calculate the Q(s,a) of a given policy pi?
Is Q-learning a model-based or model-free algorithm?
Model free
In reinforcement learning the goal of the agent is to discover the optimal ____? And what does this term mean?
Policy = what actions to take in each state, such that it maximizes the total rewards received from the environment in response to its actions.
How do we calculate Total Discounted Reward?
What is the difference between *Offline* vs *Online* learning?
These are two fundamental methods for solving MDPs. Both value-iteration and policy-iteration assume that the agent knows the MDP model of the world (i.e. the agent knows the state-transition and reward probability functions). Therefore, they can be used by the agent to (offline) plan its actions given knowledge about the environment before interacting with it. In Q-learning the agent improves its behavior (online) through learning from the history of interactions with the environment.