lecture 7 - reinforcement learning Flashcards
reinforcement learning questions
- what if we want to influence the user
- what intervention/action should we select
two actors
- the user (environment): the quantified selves
- the agent: software entity we aim to create. can observe the state of the user at time point t
- based on the state, the agent derives an action to perform
- we obtain a reward (at the next time point)
RL loop
- initial state S_t
- action selection A_t, based on S_t (by the agent)
- environment/user response: The environment responds to A_t by transitioning to new state S_{t+1} and providing a reward R_{t+1}
- feedback to agent
- prediction and learning: The agent makes predictions about future states and rewards, updates its policy based on the reward received, and learns to improve future actions.
- iteration
value function
we dont strive for immediate reward only, but rewards we accumulate in the future
value function: γ
- discount factor [0,1]
- γ = 0 we only care about the immediate reward
- γ = 1 future rewards are equally important as the current reward
policy
- maps a user state to an action (when to do what)
- this is what we try to find
RL balance
we should balance exploration and exploitation to learn the best policy
markov property
probability of ending up (t+1) in a state with a reward, can be based on either
- the entire history
- previous state and action only
- the state property has the Markov Property when both probabilities are equal for all rewards r and states s over all time points.
- The future state of a process depends only on the present state, not on the sequence of events that preceded it. (such as in chess)
markov decision process (MDP)
- if the markov property is satisfied, we can model our problem as a MDP with a finite number of states
- transition probability from one state s to a state s’ when taking action a
- expected reward at t+1, given s, a, s’
- policy pi(a|s): selects a probability of an action in a state
- state-value function: expected value of state s, given that we follow policy pi thereafter
- action-value function: expected return of a policy if we selection action a in state s
MDP: transition probability
probability of transitioning to state s’ at {t+1}, given s and a at {t}
MDP: expected reward
the expected reward for the transition from s to s’ (i.e., a expected value)
MDP: policy pi(a|s)
selects a probability of an action in a state
- we want to find policies with the highest state-value function and action-value function over all states
MDP: state-value function
- v
- the expected value of state s = the expected total return, given that we follow policy pi thereafter
- measures how “good” it is to be in a specific state, assuming the agent acts according to the policy π from that point onward.
- It evaluates the long-term potential of being in state 𝑠, without considering any specific action to be taken immediately.
MDP: action-value function
- q
- the expected return of a policy if we selection action a in state s
- measures how “good” it is to take a particular action 𝑎
a in a specific state 𝑠, considering not just the immediate reward from taking that action, but also the long-term rewards obtained by following the policy 𝜋 afterward. - It helps in evaluating the potential of an action in a given state, making it useful for decision-making.
MDP: which policies do we want to find
- We are interested in finding the policy (or policies) π∗ that provides the highest state-value function in all states. It tells us the best expected return achievable from state s when following the optimal policy.
- Similarly, we define the optimal action-value function: q∗(s, a) given our optimal policies π∗. It helps in determining the best action to take in any given state.
- these maximize the expected return
one step SARSA
- State-Action-Reward-State-Action
- on-policy reinforcement learning algorithm used to learn the value of state-action pairs
- The goal is to update the action value function q(s,a) based on the observed rewards and transitions.
SARSA update rule
- update the value for the state Q(St, At)
- actions are selection based on the Q(St, At) values
Q(S, A)
the Q-value Q(S,A) represents the expected return (reward) for taking action A in state S and following a particular policy thereafter. The Q-values guide the agent in making decisions to maximize its cumulative rewards.
SARSA: ‘on-policy’
means that we pick our actions in the same way at each step
SARSA algorithm
- initialize Q values
- initialize state
- select initial action
MAIN LOOP:
1. perform action A
2. increment time
3. observe new state S’
4. observe reward R
5. select new action A’
6. perform new action A’
7. update Q value
8. transition to new state and action
Q-learning
- when evolving a policy with Q-learning, we do not perform the next action A’ before updating our Q-values (unlike SARSA)
- we just assume that we select the highest value in the next state
- off-policy approach
SARSA vs Q-learning
while in SARSA we updated our values for Q(S, A) based on the value of the selected action in the next state S′ using the same policy (e.g. ε-greedy), for Q-learning we directly select the action that has the highest value for the next state. This simplifies our algorithm
eligibility traces
since actions taken in the past might contribute to rewards received several steps later, we distribute credit to these actions over multiple steps.
Z_t(s,a) is included as a product in the update equations for SARSA and Q-Learning
- If the state-action pair is more eligible (i.e. in our history the pair has been applied more frequently), the magnitude of the update is increased.
Z_t(s, a)
- if s=S_t and a=A_t: γλZ_{t−1} +1
- otherwise: γλZ_{t−1}
- i.e., We add 1 to the eligibility trace of a state if the state occurred and 0 otherwise. this ensures that the learning algorithm strongly prioritizes the most recent decision.
- λ and γ = determine how quickly the history of the eligibility trace decays. This decay allows older state-action pairs to gradually lose their influence on the updates, but not immediately.