lecture 7 - reinforcement learning Flashcards
reinforcement learning questions
- what if we want to influence the user
- what intervention/action should we select
two actors
- the user (environment): the quantified selves
- the agent: software entity we aim to create. can observe the state of the user at time point t
- based on the state, the agent derives an action to perform
- we obtain a reward (at the next time point)
RL loop
- initial state S_t
- action selection A_t, based on S_t (by the agent)
- environment/user response: The environment responds to A_t by transitioning to new state S_{t+1} and providing a reward R_{t+1}
- feedback to agent
- prediction and learning: The agent makes predictions about future states and rewards, updates its policy based on the reward received, and learns to improve future actions.
- iteration
value function
we dont strive for immediate reward only, but rewards we accumulate in the future
value function: γ
- discount factor [0,1]
- γ = 0 we only care about the immediate reward
- γ = 1 future rewards are equally important as the current reward
policy
- maps a user state to an action (when to do what)
- this is what we try to find
RL balance
we should balance exploration and exploitation to learn the best policy
markov property
probability of ending up (t+1) in a state with a reward, can be based on either
- the entire history
- previous state and action only
- the state property has the Markov Property when both probabilities are equal for all rewards r and states s over all time points.
- The future state of a process depends only on the present state, not on the sequence of events that preceded it. (such as in chess)
markov decision process (MDP)
- if the markov property is satisfied, we can model our problem as a MDP with a finite number of states
- transition probability from one state s to a state s’ when taking action a
- expected reward at t+1, given s, a, s’
- policy pi(a|s): selects a probability of an action in a state
- state-value function: expected value of state s, given that we follow policy pi thereafter
- action-value function: expected return of a policy if we selection action a in state s
MDP: transition probability
probability of transitioning to state s’ at {t+1}, given s and a at {t}
MDP: expected reward
the expected reward for the transition from s to s’ (i.e., a expected value)
MDP: policy pi(a|s)
selects a probability of an action in a state
- we want to find policies with the highest state-value function and action-value function over all states
MDP: state-value function
- v
- the expected value of state s = the expected total return, given that we follow policy pi thereafter
- measures how “good” it is to be in a specific state, assuming the agent acts according to the policy π from that point onward.
- It evaluates the long-term potential of being in state 𝑠, without considering any specific action to be taken immediately.
MDP: action-value function
- q
- the expected return of a policy if we selection action a in state s
- measures how “good” it is to take a particular action 𝑎
a in a specific state 𝑠, considering not just the immediate reward from taking that action, but also the long-term rewards obtained by following the policy 𝜋 afterward. - It helps in evaluating the potential of an action in a given state, making it useful for decision-making.
MDP: which policies do we want to find
- We are interested in finding the policy (or policies) π∗ that provides the highest state-value function in all states. It tells us the best expected return achievable from state s when following the optimal policy.
- Similarly, we define the optimal action-value function: q∗(s, a) given our optimal policies π∗. It helps in determining the best action to take in any given state.
- these maximize the expected return