lecture 7 - reinforcement learning Flashcards

1
Q

reinforcement learning questions

A
  1. what if we want to influence the user
  2. what intervention/action should we select
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

two actors

A
  1. the user (environment): the quantified selves
  2. the agent: software entity we aim to create. can observe the state of the user at time point t
  3. based on the state, the agent derives an action to perform
  4. we obtain a reward (at the next time point)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RL loop

A
  1. initial state S_t
  2. action selection A_t, based on S_t (by the agent)
  3. environment/user response: The environment responds to A_t by transitioning to new state S_{t+1} and providing a reward R_{t+1}
  4. feedback to agent
  5. prediction and learning: The agent makes predictions about future states and rewards, updates its policy based on the reward received, and learns to improve future actions.
  6. iteration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

value function

A

we dont strive for immediate reward only, but rewards we accumulate in the future

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

value function: γ

A
  • discount factor [0,1]
  • γ = 0 we only care about the immediate reward
  • γ = 1 future rewards are equally important as the current reward
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

policy

A
  • maps a user state to an action (when to do what)
  • this is what we try to find
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

RL balance

A

we should balance exploration and exploitation to learn the best policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

markov property

A

probability of ending up (t+1) in a state with a reward, can be based on either

  1. the entire history
  2. previous state and action only
  • the state property has the Markov Property when both probabilities are equal for all rewards r and states s over all time points.
  • The future state of a process depends only on the present state, not on the sequence of events that preceded it. (such as in chess)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

markov decision process (MDP)

A
  • if the markov property is satisfied, we can model our problem as a MDP with a finite number of states
  • transition probability from one state s to a state s’ when taking action a
  • expected reward at t+1, given s, a, s’
  • policy pi(a|s): selects a probability of an action in a state
  • state-value function: expected value of state s, given that we follow policy pi thereafter
  • action-value function: expected return of a policy if we selection action a in state s
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

MDP: transition probability

A

probability of transitioning to state s’ at {t+1}, given s and a at {t}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

MDP: expected reward

A

the expected reward for the transition from s to s’ (i.e., a expected value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MDP: policy pi(a|s)

A

selects a probability of an action in a state

  • we want to find policies with the highest state-value function and action-value function over all states
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

MDP: state-value function

A
  • v
  • the expected value of state s = the expected total return, given that we follow policy pi thereafter
  • measures how “good” it is to be in a specific state, assuming the agent acts according to the policy π from that point onward.
  • It evaluates the long-term potential of being in state 𝑠, without considering any specific action to be taken immediately.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

MDP: action-value function

A
  • q
  • the expected return of a policy if we selection action a in state s
  • measures how “good” it is to take a particular action 𝑎
    a in a specific state 𝑠, considering not just the immediate reward from taking that action, but also the long-term rewards obtained by following the policy 𝜋 afterward.
  • It helps in evaluating the potential of an action in a given state, making it useful for decision-making.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

MDP: which policies do we want to find

A
  • We are interested in finding the policy (or policies) π∗ that provides the highest state-value function in all states. It tells us the best expected return achievable from state s when following the optimal policy.
  • Similarly, we define the optimal action-value function: q∗(s, a) given our optimal policies π∗. It helps in determining the best action to take in any given state.
  • these maximize the expected return
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

one step SARSA

A
  • State-Action-Reward-State-Action
  • on-policy reinforcement learning algorithm used to learn the value of state-action pairs
  • The goal is to update the action value function q(s,a) based on the observed rewards and transitions.
17
Q

SARSA update rule

A
  • update the value for the state Q(St, At)
  • actions are selection based on the Q(St, At) values
18
Q

Q(S, A)

A

the Q-value Q(S,A) represents the expected return (reward) for taking action A in state S and following a particular policy thereafter. The Q-values guide the agent in making decisions to maximize its cumulative rewards.

19
Q

SARSA: ‘on-policy’

A

means that we pick our actions in the same way at each step

20
Q

SARSA algorithm

A
  1. initialize Q values
  2. initialize state
  3. select initial action

MAIN LOOP:
1. perform action A
2. increment time
3. observe new state S’
4. observe reward R
5. select new action A’
6. perform new action A’
7. update Q value
8. transition to new state and action

21
Q

Q-learning

A
  • when evolving a policy with Q-learning, we do not perform the next action A’ before updating our Q-values (unlike SARSA)
  • we just assume that we select the highest value in the next state
  • off-policy approach
22
Q

SARSA vs Q-learning

A

while in SARSA we updated our values for Q(S, A) based on the value of the selected action in the next state S′ using the same policy (e.g. ε-greedy), for Q-learning we directly select the action that has the highest value for the next state. This simplifies our algorithm

23
Q

eligibility traces

A

since actions taken in the past might contribute to rewards received several steps later, we distribute credit to these actions over multiple steps.

Z_t(s,a) is included as a product in the update equations for SARSA and Q-Learning

  • If the state-action pair is more eligible (i.e. in our history the pair has been applied more frequently), the magnitude of the update is increased.
24
Q

Z_t(s, a)

A
  • if s=S_t and a=A_t: γλZ_{t−1} +1
  • otherwise: γλZ_{t−1}
  • i.e., We add 1 to the eligibility trace of a state if the state occurred and 0 otherwise. this ensures that the learning algorithm strongly prioritizes the most recent decision.
  • λ and γ = determine how quickly the history of the eligibility trace decays. This decay allows older state-action pairs to gradually lose their influence on the updates, but not immediately.
25
Q

approximate solutions

A
  • we approximate the Q-value for state action pairs.
  • this model f(S, A, w) is parameterized by a set of weights that are adjusted during training
  • th error is the sum of squared differences between Q(S, A) and f(S, A, w)
26
Q

continuous values in the state space

A

we use the u-tree algorithm to discretize: We build a state tree, that maps our continuous values to
a state

  • We start with a single leaf
  • We collect data for a while
  • For all attributes Xi we try different splits based on the
    (sorted) values we have collected
  • We test whether the splits result in a significant difference in Q-values using the Kolmogorov Smirnov test
  • We select the attribute with the lowest p-value and split on it (if below 0.05)
  • We continue collecting data again and repeat the procedure per leaf
27
Q

challenges in the field

A
  1. learning full circle: learning quickly, safely, and using future predictions
  2. heterogeneity: learn across devices and people, and coordinate behavior
  3. effective data collection and reuse: collection of data (active learning) and transfer between use cases (transfer learning)
  4. data processing and storage: storing data (where, what), processing data (when, where), battery management
  5. better predictive modeling and clustering: better features with less effort, domain knowledge, temporal learning, explainability of models
  6. validation: perform validation, definition of success, setup of validation (slow approval process)