13 Markov Decision Processes Flashcards
RL– what makes it different from other ML paradigms and what are the challenges?
Supervision (no supervision, environment rewards), Feedback may be delayed, Time matters as we are dealing with sequential data, No i.i.d as agent’s actions informs the next data it receives
Challenges: Exploration vs Exploitation tradeoff, Credit assignment
Model-free vs. Model-based
Model-Based RL algorithms are applied when the agent either
* … This is usually not the case!
* uses a model of the environment: 𝑃, 𝑅
Acting in the model environment is itself a Markov Decision Process.
* Now, the agent can …
* Upon interacting with the real environment, the …
In model-free RL, agents do not..
has access to the true environment, i.e., the state transition 𝑃 and reward functions 𝑅.
plan by thinking ahead to obtain (or learn) an
optimal policy before actually acting out the plan in the “real” environment.
model can be updated from observed data and will become more accurate over time.
maintain an explicit model of the environment.
Q-Learning Background 1: Monte Carlo Learning
Problems: …MC learning is very sample-inefficient.
Temporal Credit Assignment Problem: Given sequence of actions and rewards, how to assign credit/blame for each action?
Deep Q-Learning Networks (DQNs) with Discrete Actions
Optimal Q-function is approximated using a Neural Network. We learn the …The NN … (e.g., a large grid game) as compared to tabular methods.
if one bad action leads to a loss in an otherwise good episode, all the good moves are penalized. If only complex policies led to a reward, the agent never learns. Episodes may be infinitely long, so the total reward cannot be sampled.
Q-value of a set of |A| actions and pick the maximum.
generalizes and can better handle large state spaces
Policy Optimization
Methods in this family represent a policy explicitly as 𝜋𝜃(𝑎|𝑠).
They optimize …
In a tabular setting with discrete actions, 𝜋𝜃 is an …
In deep RL, the policy is a …This allows us to ..
Policy optimization is almost always performed on-policy, which means that …
the parameters 𝜃 directly by gradient ascent on the performance objective 𝐽(𝜋𝜃).
explicit vector of probabilities.
deep neural network, either outputting probabilities or a specific action. ;handle large state spaces with an NN.
each update only uses data collected while acting according to the most recent version of the policy. Each sample is used once only.
Policy Optimization Methods
Characteristics
* We don‘t need the …It is model-free.
* We learn a … without consulting a value
function. The stochastic policy class smooths out the problem.
* Parameters in DeepRL are the …
* Policy gradient methods work well with …
Use a sigmoid activation for discrete actions or a ..
Stochastic policies allow us to …
distribution of states or the environment dynamics.
parametrized policy and select stochastic actions
weights of a function-approximating NN, learned with SGD
continuous action spaces.
Gaussian for continuous (but stochastic) actions;
explore new actions, not just exploit known ones.