Multi Agent Reinforcement Learning Flashcards
Why multi-agent systems?
MAS
Because this allows you to delete the single point of
failure and distribute computational load.
Why Reinforcement Learning for Multi Agents
MAS are Very complex Highly dynamic Non-deterministic Actions of one agent influences world model of others Many tasks require cooperation Need for individual level adaptation => Design difficult RL can help make it for you
Reinforcement Learning
Learns from interactions and experience not examples
Delayed rewards, no direct feedback, needs to explore
Situations to action mapping, policy
Policy
A policy is a strategy that an agent uses in pursuit of goals. The policy dictates the actions that the agent takes as a function of the agent’s state and the environment. indicated by pi, pi(s) will choose the action at s.
a function that translates from a state, action pair to a probability (probability of executing that action in
that state) or function that translates from state to action. Tells the agent how to behave.
Single Agent Problem
Markov decision process, only needs info of current state
set of states
set of actions
unknown transitions SxA-> S
unknown reward: SxA-> r
Need to find policy such that the sum of rewards over time is maximized (V). There is a discount factor (0 to 1) gamma on the rewards for the sum to modulate weight of future rewards (and thus importance)
Nondeterministic Environments and Policies
Probabilistic transitions (non deteterministic, same action in a state can lead to different outcomes) SxAxS = between 0 and 1 Probabilistic rewards (non deterministic, same action in a state can lead to different rewards) Probabilistic Policy(non deterministic, might not take the same action in the same state): SxA -> between 0 and 1 Proability distributions should be stationary
Bellman Equation
Can use on Markov Decision Process (probability of taking certain actions in a state, past doesn’t matter)
Function that adds current reward plus future reward. Choose state that gives you the most current and future reward
r(S,pi(S)) + gamma V(transition(state, pi(state))
gamma weights future reward -> discount factor
requires the knowledge of transition and reward
can then apply it with policy iteration (initialize with final reward and other states to reach it are 0, policy set with specific actions)
To Do until convergence of state Values, then improve the policy (by changing actions on states)
actor critic
policy improvement will be the max argument
pi(s) = argmax(r(s,a)+gammaVpi(s,a)) (value iteration ?)
guarantee convergence to optimal
loop through all states max a (R(s,a)+gamma sum P(s,a,s')V(s') gamma is discount factor P is probability s', looping through all possible state we can lead to
if no proba then just remove P
value at a state
V(s)
value at a state
V(s)
Value iteration
do policy improvement after only updating one state unlike policy iteration’although often use combined) which compute all values of state until convergence then adapt policy.
This is a combination of the Bellman
equation (not according to current policy) while looking for actions that can do this. It’s a critic-algorithm that
no longer does explicit iteration of policy
For a given state, select the action that maximizes the whole expression. 𝛿(s,a) is s’ → new state. Look for that
action that maximizes the sum of rewards.
guarantee convergence to optimal
State-action values
Doesnt require the knowledge of transtion and reward (unlike policy/value iteration)
values are Q-values. Q-value = immediate reward + discount factor ×
Q-value of the best action that you can do in that state.
Q Learning deterministic and non deterministic
Initialize the Q values randomly intialize state while do: choose action, observe r and next state, compute Q(s,a) = r+ gamma max_a Q(s',a') gamma is discount factor s is s'
immediate reward, not delayed
If non deterministic then the values won’t converge -> average new & historic Q-values (higher weight on history)
To choose action: Epsilon-greedy or Boltzmann Exploration
Epsilon Greedy & Boltzmann Exploration
Choose an action for Q learning
Random number, below exploration rate then pick random action, above rate then select action with highest Q value
make epsilon decay over time
Boltzman selection action with certain proba, T makes it decay over time
=> exploration vs exploitation
Multi-Agent Systems
- Exponential state space S growth with the number of agents
- All actions jointly influence state transitions and rewards: space of possible actions and joint-action
outcomes grow exponentially with number of agents, and rewards can be shared or individual.
Credit assignment problem: in traditional RL the question is which actions led
to this rewards(temporal difference: what your experience is telling you (what you expect from environment), in multi-agent RL additionally the actions of which agents led to
this reward
Rewards cooperating agents
Less learnability
- Full system reward: if successful, give reward
- Local reward: direct local feedback to learn quicker
- Difference reward: each agent gets information about its impact (based on rewards with the agent’s
and without the agent’s action)
More Learnability