1. the user (environment): the quantified selves 2. the agent: software entity we aim to create. can observe the state of the user at time point t 3. based on the state, the agent derives an action to perform 4. we obtain a reward (at the next time point)

1. initial state S_t 2. action selection A_t, based on S_t (by the agent) 3. environment/user response: The environment responds to A_t by transitioning to new state S_{t+1} and providing a reward R_{t+1} 4. feedback to agent 5. prediction and learning: The agent makes predictions about future states and rewards, updates its policy based on the reward received, and learns to improve future actions. 6. iteration

- discount factor [0,1] - γ = 0 we only care about the immediate reward - γ = 1 future rewards are equally important as the current reward

- maps a user state to an action (when to do what) - this is what we try to find

probability of ending up (t+1) in a state with a reward, can be based on either 1. the entire history 2. previous state and action only - the state property has the Markov Property when both probabilities are equal for all rewards r and states s over all time points. - The future state of a process depends only on the present state, not on the sequence of events that preceded it. (such as in chess)

selects a probability of an action in a state - we want to find policies with the highest state-value function and action-value function over all states

lecture 7 - reinforcement learning Flashcards by Kiara Shivani

reinforcement learning questions

what if we want to influence the user
what intervention/action should we select

How well did you know this?

Not at all

Perfectly

two actors

the user (environment): the quantified selves
the agent: software entity we aim to create. can observe the state of the user at time point t
based on the state, the agent derives an action to perform
we obtain a reward (at the next time point)

How well did you know this?

Not at all

Perfectly

RL loop

initial state S_t
action selection A_t, based on S_t (by the agent)
environment/user response: The environment responds to A_t by transitioning to new state S_{t+1} and providing a reward R_{t+1}
feedback to agent
prediction and learning: The agent makes predictions about future states and rewards, updates its policy based on the reward received, and learns to improve future actions.
iteration

How well did you know this?

Not at all

Perfectly

value function

we dont strive for immediate reward only, but rewards we accumulate in the future

How well did you know this?

Not at all

Perfectly

value function: γ

discount factor [0,1]
γ = 0 we only care about the immediate reward
γ = 1 future rewards are equally important as the current reward

How well did you know this?

Not at all

Perfectly

policy

maps a user state to an action (when to do what)
this is what we try to find

How well did you know this?

Not at all

Perfectly

RL balance

we should balance exploration and exploitation to learn the best policy

How well did you know this?

Not at all

Perfectly

markov property

probability of ending up (t+1) in a state with a reward, can be based on either

the entire history
previous state and action only

the state property has the Markov Property when both probabilities are equal for all rewards r and states s over all time points.
The future state of a process depends only on the present state, not on the sequence of events that preceded it. (such as in chess)

How well did you know this?

Not at all

Perfectly

markov decision process (MDP)

if the markov property is satisfied, we can model our problem as a MDP with a finite number of states
transition probability from one state s to a state s’ when taking action a
expected reward at t+1, given s, a, s’
policy pi(a|s): selects a probability of an action in a state
state-value function: expected value of state s, given that we follow policy pi thereafter
action-value function: expected return of a policy if we selection action a in state s

How well did you know this?

Not at all

Perfectly

MDP: transition probability

probability of transitioning to state s’ at {t+1}, given s and a at {t}

How well did you know this?

Not at all

Perfectly

MDP: expected reward

the expected reward for the transition from s to s’ (i.e., a expected value)

How well did you know this?

Not at all

Perfectly

MDP: policy pi(a|s)

selects a probability of an action in a state

we want to find policies with the highest state-value function and action-value function over all states

How well did you know this?

Not at all

Perfectly

MDP: state-value function

v
the expected value of state s = the expected total return, given that we follow policy pi thereafter
measures how “good” it is to be in a specific state, assuming the agent acts according to the policy π from that point onward.
It evaluates the long-term potential of being in state 𝑠, without considering any specific action to be taken immediately.

How well did you know this?

Not at all

Perfectly

MDP: action-value function

q
the expected return of a policy if we selection action a in state s
measures how “good” it is to take a particular action 𝑎
a in a specific state 𝑠, considering not just the immediate reward from taking that action, but also the long-term rewards obtained by following the policy 𝜋 afterward.
It helps in evaluating the potential of an action in a given state, making it useful for decision-making.

How well did you know this?

Not at all

Perfectly

MDP: which policies do we want to find

We are interested in finding the policy (or policies) π∗ that provides the highest state-value function in all states. It tells us the best expected return achievable from state s when following the optimal policy.
Similarly, we define the optimal action-value function: q∗(s, a) given our optimal policies π∗. It helps in determining the best action to take in any given state.
these maximize the expected return

How well did you know this?

Not at all

Perfectly

one step SARSA

Study These Flashcards

State-Action-Reward-State-Action
on-policy reinforcement learning algorithm used to learn the value of state-action pairs
The goal is to update the action value function q(s,a) based on the observed rewards and transitions.

SARSA update rule

Study These Flashcards

update the value for the state Q(St, At)
actions are selection based on the Q(St, At) values

Q(S, A)

Study These Flashcards

the Q-value Q(S,A) represents the expected return (reward) for taking action A in state S and following a particular policy thereafter. The Q-values guide the agent in making decisions to maximize its cumulative rewards.

SARSA: ‘on-policy’

Study These Flashcards

means that we pick our actions in the same way at each step

SARSA algorithm

Study These Flashcards

initialize Q values
initialize state
select initial action

MAIN LOOP:
1. perform action A
2. increment time
3. observe new state S’
4. observe reward R
5. select new action A’
6. perform new action A’
7. update Q value
8. transition to new state and action

Q-learning

Study These Flashcards

when evolving a policy with Q-learning, we do not perform the next action A’ before updating our Q-values (unlike SARSA)
we just assume that we select the highest value in the next state
off-policy approach

SARSA vs Q-learning

Study These Flashcards

while in SARSA we updated our values for Q(S, A) based on the value of the selected action in the next state S′ using the same policy (e.g. ε-greedy), for Q-learning we directly select the action that has the highest value for the next state. This simplifies our algorithm

eligibility traces

Study These Flashcards

since actions taken in the past might contribute to rewards received several steps later, we distribute credit to these actions over multiple steps.

Z_t(s,a) is included as a product in the update equations for SARSA and Q-Learning

If the state-action pair is more eligible (i.e. in our history the pair has been applied more frequently), the magnitude of the update is increased.

Z_t(s, a)

Study These Flashcards

if s=S_t and a=A_t: γλZ_{t−1} +1
otherwise: γλZ_{t−1}
i.e., We add 1 to the eligibility trace of a state if the state occurred and 0 otherwise. this ensures that the learning algorithm strongly prioritizes the most recent decision.
λ and γ = determine how quickly the history of the eligibility trace decays. This decay allows older state-action pairs to gradually lose their influence on the updates, but not immediately.

approximate solutions

- we approximate the Q-value for state action pairs. - this model f(S, A, w) is parameterized by a set of weights that are adjusted during training - th error is the sum of squared differences between Q(S, A) and f(S, A, w)

continuous values in the state space

we use the u-tree algorithm to discretize: We build a state tree, that maps our continuous values to a state - We start with a single leaf - We collect data for a while - For all attributes Xi we try different splits based on the (sorted) values we have collected - We test whether the splits result in a significant difference in Q-values using the Kolmogorov Smirnov test - We select the attribute with the lowest p-value and split on it (if below 0.05) - We continue collecting data again and repeat the procedure per leaf

challenges in the field

1. **learning full circle**: learning quickly, safely, and using future predictions 2. **heterogeneity**: learn across devices and people, and coordinate behavior 3. **effective data collection and reuse**: collection of data (active learning) and transfer between use cases (transfer learning) 4. **data processing and storage**: storing data (where, what), processing data (when, where), battery management 5. **better predictive modeling and clustering**: better features with less effort, domain knowledge, temporal learning, explainability of models 6. **validation**: perform validation, definition of success, setup of validation (slow approval process)

lecture 7 - reinforcement learning Flashcards

(27 cards)