Multi Agent Reinforcement Learning Flashcards
Why multi-agent systems?
MAS
Because this allows you to delete the single point of
failure and distribute computational load.
Why Reinforcement Learning for Multi Agents
MAS are Very complex Highly dynamic Non-deterministic Actions of one agent influences world model of others Many tasks require cooperation Need for individual level adaptation => Design difficult RL can help make it for you
Reinforcement Learning
Learns from interactions and experience not examples
Delayed rewards, no direct feedback, needs to explore
Situations to action mapping, policy
Policy
A policy is a strategy that an agent uses in pursuit of goals. The policy dictates the actions that the agent takes as a function of the agent’s state and the environment. indicated by pi, pi(s) will choose the action at s.
a function that translates from a state, action pair to a probability (probability of executing that action in
that state) or function that translates from state to action. Tells the agent how to behave.
Single Agent Problem
Markov decision process, only needs info of current state
set of states
set of actions
unknown transitions SxA-> S
unknown reward: SxA-> r
Need to find policy such that the sum of rewards over time is maximized (V). There is a discount factor (0 to 1) gamma on the rewards for the sum to modulate weight of future rewards (and thus importance)
Nondeterministic Environments and Policies
Probabilistic transitions (non deteterministic, same action in a state can lead to different outcomes) SxAxS = between 0 and 1 Probabilistic rewards (non deterministic, same action in a state can lead to different rewards) Probabilistic Policy(non deterministic, might not take the same action in the same state): SxA -> between 0 and 1 Proability distributions should be stationary
Bellman Equation
Can use on Markov Decision Process (probability of taking certain actions in a state, past doesn’t matter)
Function that adds current reward plus future reward. Choose state that gives you the most current and future reward
r(S,pi(S)) + gamma V(transition(state, pi(state))
gamma weights future reward -> discount factor
requires the knowledge of transition and reward
can then apply it with policy iteration (initialize with final reward and other states to reach it are 0, policy set with specific actions)
To Do until convergence of state Values, then improve the policy (by changing actions on states)
actor critic
policy improvement will be the max argument
pi(s) = argmax(r(s,a)+gammaVpi(s,a)) (value iteration ?)
guarantee convergence to optimal
loop through all states max a (R(s,a)+gamma sum P(s,a,s')V(s') gamma is discount factor P is probability s', looping through all possible state we can lead to
if no proba then just remove P
value at a state
V(s)
value at a state
V(s)
Value iteration
do policy improvement after only updating one state unlike policy iteration’although often use combined) which compute all values of state until convergence then adapt policy.
This is a combination of the Bellman
equation (not according to current policy) while looking for actions that can do this. It’s a critic-algorithm that
no longer does explicit iteration of policy
For a given state, select the action that maximizes the whole expression. 𝛿(s,a) is s’ → new state. Look for that
action that maximizes the sum of rewards.
guarantee convergence to optimal
State-action values
Doesnt require the knowledge of transtion and reward (unlike policy/value iteration)
values are Q-values. Q-value = immediate reward + discount factor ×
Q-value of the best action that you can do in that state.
Q Learning deterministic and non deterministic
Initialize the Q values randomly intialize state while do: choose action, observe r and next state, compute Q(s,a) = r+ gamma max_a Q(s',a') gamma is discount factor s is s'
immediate reward, not delayed
If non deterministic then the values won’t converge -> average new & historic Q-values (higher weight on history)
To choose action: Epsilon-greedy or Boltzmann Exploration
Epsilon Greedy & Boltzmann Exploration
Choose an action for Q learning
Random number, below exploration rate then pick random action, above rate then select action with highest Q value
make epsilon decay over time
Boltzman selection action with certain proba, T makes it decay over time
=> exploration vs exploitation
Multi-Agent Systems
- Exponential state space S growth with the number of agents
- All actions jointly influence state transitions and rewards: space of possible actions and joint-action
outcomes grow exponentially with number of agents, and rewards can be shared or individual.
Credit assignment problem: in traditional RL the question is which actions led
to this rewards(temporal difference: what your experience is telling you (what you expect from environment), in multi-agent RL additionally the actions of which agents led to
this reward
Rewards cooperating agents
Less learnability
- Full system reward: if successful, give reward
- Local reward: direct local feedback to learn quicker
- Difference reward: each agent gets information about its impact (based on rewards with the agent’s
and without the agent’s action)
More Learnability
Competing agents
- Each agent has its own reward function
- Deterministic policies can be exploited: need to learn mixed policies. If an agent handles according to
its best policy, it will get stuck in that policy → it becomes predictable. - optimal policy depends on other agents (Best response policy)
Best response policy
Competing agent strategy
policy that given the policy of other agents is optimal → given the strategies of my
opponents, I know that this is my best strategy.
Can result in Nash Equilibrium: collection of strategies such that no player can improve by changing its own policy. Each player’s strategy is a best response.
OR
Pareto optimality: strategy for which you cannot improve the situation of one, while keeping the situation of all other equal or better.
There can be multiple of each
Because the policies changes, the best response policy moves and thus this is not a Markovian anymore
Non-stationary environments
Actions outcomes are jointly determined by all agents. Agents learn
simultaneously and an agent’s action doesn’t influence the state according to a
fixed probability distribution. This means this is no longer a Markovian Environment
Symmetric game
reward structure looks exactly the same, whether you are the column player or the row
player, there is no difference (the action pair leads to the same reward for all players). I.e. prisoner’s dilemma (not stable).
Asymmetric is when the values differ in the cells.
Zero-sum game
add the rewards of the two players in any of the outcomes, the sum is 0 (what 1 player earns, the other loses). Best strategy: random.
Q Learning normal form games
independent learner → other agents are ignored; agent behaves as during
single agent learning and interactions with other agents are considered noise in a stochastic environment
Might still work but people have develop methods with the aim that they converge & behave rationally eg: Learning Automata
Learning Automata
policy iteration in stateless environments, originally assume a finite action set.
Like Q-learning, uses immediate reward to update its action selection probabilities.
Each agent has a vector with number of actions dimensions that stores action preference, how likely it is to perform an action ()
values that are updated each time-step through a function (chosen action is incremented and others subtracted, small changes to action likely to occur, bigger changes to not likely actions, multiplied with rewards -> if very likely to do an action with poor reward then will be updated a lot)
alpha beta reward and penalty parameters
alhpa =1 and beta=0 if cross learning
alpha = beta then linear learning
assume reward between 0 and 1
Does not take state information (stateless approach): Network of Learning Automata can
Network of Learning Automata
To handle different states one could use a different learning automata for each
state.
Then use the average reward obtained since the last visit in the update
r = sum_t t_t/T-l
T is now
l was the last time this state was
visited
Cross Learning
– trying to describe human learning behaviour. X for action i (instead of 𝜋(ai)). Capital R
for reward (instead of r). To make someone unlearn a bad habit, there needs to be very bad,
immediate ‘punishment’. Second part: always smaller than first part, because probability is [0,1].
‘doing something makes you do it again’. Doing one action lowers the probability of doing another
action, no matter the reward.
Regret Minimization
Stateless approach
expandable through a network of learners
Assumes hindsight knowledge of the best possible payoff, calculates a loss for taking action aj instead of the best action: lj = R - Rj
Polynomial Weights
Maintains a ser of weights wj, one for each action
Updates weights according to the perceived loss ( 1 - alpa*loss)
then are normalized for the policy to be legal
Joint Action Learning
Opponent modelling. Remember action-selection-count Cja for all other agents
Learn Q-values for joint actions
Coordination becomes possible in cooperative MAS
Much more statistics need to be kept
Assums that observing the actions of all agents is possible
Might need to know about policy update mechanisms to make correct
predictions.