- Exponential state space S growth with the number of agents - All actions jointly influence state transitions and rewards: space of possible actions and joint-action outcomes grow exponentially with number of agents, and rewards can be shared or individual. Credit assignment problem: in traditional RL the question is which actions led to this rewards(temporal difference: what your experience is telling you (what you expect from environment), in multi-agent RL additionally the actions of which agents led to this reward

Multi Agent Reinforcement Learning Flashcards by Emma Vande Wouwer

Why multi-agent systems?

MAS
Because this allows you to delete the single point of
failure and distribute computational load.

How well did you know this?

Not at all

Perfectly

Why Reinforcement Learning for Multi Agents

MAS are
Very complex
Highly dynamic
Non-deterministic
Actions of one agent influences world model of others
Many tasks require cooperation
Need for individual level adaptation
=> Design difficult
RL can help make it for you

How well did you know this?

Not at all

Perfectly

Reinforcement Learning

Learns from interactions and experience not examples
Delayed rewards, no direct feedback, needs to explore
Situations to action mapping, policy

How well did you know this?

Not at all

Perfectly

Policy

A policy is a strategy that an agent uses in pursuit of goals. The policy dictates the actions that the agent takes as a function of the agent’s state and the environment. indicated by pi, pi(s) will choose the action at s.

a function that translates from a state, action pair to a probability (probability of executing that action in
that state) or function that translates from state to action. Tells the agent how to behave.

How well did you know this?

Not at all

Perfectly

Single Agent Problem

Markov decision process, only needs info of current state
set of states
set of actions
unknown transitions SxA-> S
unknown reward: SxA-> r
Need to find policy such that the sum of rewards over time is maximized (V). There is a discount factor (0 to 1) gamma on the rewards for the sum to modulate weight of future rewards (and thus importance)

How well did you know this?

Not at all

Perfectly

Nondeterministic Environments and Policies

Probabilistic transitions (non deteterministic, same action in a state can lead to different outcomes) SxAxS = between 0 and 1
Probabilistic rewards (non deterministic, same action in a state can lead to different rewards)
Probabilistic Policy(non deterministic, might not take the same action in the same state): SxA -> between 0 and 1
Proability distributions should be stationary

How well did you know this?

Not at all

Perfectly

Bellman Equation

Can use on Markov Decision Process (probability of taking certain actions in a state, past doesn’t matter)
Function that adds current reward plus future reward. Choose state that gives you the most current and future reward
r(S,pi(S)) + gamma V(transition(state, pi(state))
gamma weights future reward -> discount factor
requires the knowledge of transition and reward

can then apply it with policy iteration (initialize with final reward and other states to reach it are 0, policy set with specific actions)
To Do until convergence of state Values, then improve the policy (by changing actions on states)
actor critic
policy improvement will be the max argument
pi(s) = argmax(r(s,a)+gammaVpi(s,a)) (value iteration ?)
guarantee convergence to optimal

loop through all states
max a (R(s,a)+gamma sum P(s,a,s')V(s')
gamma is discount factor
P is probability
s', looping through all possible state we can lead to

if no proba then just remove P

How well did you know this?

Not at all

Perfectly

value at a state

V(s)

How well did you know this?

Not at all

Perfectly

value at a state

V(s)

How well did you know this?

Not at all

Perfectly

Value iteration

do policy improvement after only updating one state unlike policy iteration’although often use combined) which compute all values of state until convergence then adapt policy.
This is a combination of the Bellman
equation (not according to current policy) while looking for actions that can do this. It’s a critic-algorithm that
no longer does explicit iteration of policy
For a given state, select the action that maximizes the whole expression. 𝛿(s,a) is s’ → new state. Look for that
action that maximizes the sum of rewards.
guarantee convergence to optimal

How well did you know this?

Not at all

Perfectly

State-action values

Doesnt require the knowledge of transtion and reward (unlike policy/value iteration)
values are Q-values. Q-value = immediate reward + discount factor ×
Q-value of the best action that you can do in that state.

How well did you know this?

Not at all

Perfectly

Q Learning deterministic and non deterministic

Initialize the Q values randomly
intialize state
while do: choose action, observe r and next state, compute Q(s,a) = r+ gamma max_a Q(s',a')
gamma is discount factor
s is s'

immediate reward, not delayed

If non deterministic then the values won’t converge -> average new & historic Q-values (higher weight on history)

To choose action: Epsilon-greedy or Boltzmann Exploration

How well did you know this?

Not at all

Perfectly

Epsilon Greedy & Boltzmann Exploration

Choose an action for Q learning
Random number, below exploration rate then pick random action, above rate then select action with highest Q value
make epsilon decay over time

Boltzman selection action with certain proba, T makes it decay over time

=> exploration vs exploitation

How well did you know this?

Not at all

Perfectly

Multi-Agent Systems

Exponential state space S growth with the number of agents
All actions jointly influence state transitions and rewards: space of possible actions and joint-action
outcomes grow exponentially with number of agents, and rewards can be shared or individual.

Credit assignment problem: in traditional RL the question is which actions led
to this rewards(temporal difference: what your experience is telling you (what you expect from environment), in multi-agent RL additionally the actions of which agents led to
this reward

How well did you know this?

Not at all

Perfectly

Rewards cooperating agents

Less learnability
- Full system reward: if successful, give reward
- Local reward: direct local feedback to learn quicker
- Difference reward: each agent gets information about its impact (based on rewards with the agent’s
and without the agent’s action)
More Learnability

How well did you know this?

Not at all

Perfectly

Competing agents

Study These Flashcards

Each agent has its own reward function
Deterministic policies can be exploited: need to learn mixed policies. If an agent handles according to
its best policy, it will get stuck in that policy → it becomes predictable.
optimal policy depends on other agents (Best response policy)

Best response policy

Study These Flashcards

Competing agent strategy
policy that given the policy of other agents is optimal → given the strategies of my
opponents, I know that this is my best strategy.

Can result in Nash Equilibrium: collection of strategies such that no player can improve by changing its own policy. Each player’s strategy is a best response.
OR
Pareto optimality: strategy for which you cannot improve the situation of one, while keeping the situation of all other equal or better.

There can be multiple of each

Because the policies changes, the best response policy moves and thus this is not a Markovian anymore

Non-stationary environments

Study These Flashcards

Actions outcomes are jointly determined by all agents. Agents learn
simultaneously and an agent’s action doesn’t influence the state according to a
fixed probability distribution. This means this is no longer a Markovian Environment

Symmetric game

Study These Flashcards

reward structure looks exactly the same, whether you are the column player or the row
player, there is no difference (the action pair leads to the same reward for all players). I.e. prisoner’s dilemma (not stable).
Asymmetric is when the values differ in the cells.

Zero-sum game

Study These Flashcards

add the rewards of the two players in any of the outcomes, the sum is 0 (what 1 player earns, the other loses). Best strategy: random.

Q Learning normal form games

Study These Flashcards

independent learner → other agents are ignored; agent behaves as during
single agent learning and interactions with other agents are considered noise in a stochastic environment
Might still work but people have develop methods with the aim that they converge & behave rationally eg: Learning Automata

Learning Automata

Study These Flashcards

policy iteration in stateless environments, originally assume a finite action set.
Like Q-learning, uses immediate reward to update its action selection probabilities.
Each agent has a vector with number of actions dimensions that stores action preference, how likely it is to perform an action ()
values that are updated each time-step through a function (chosen action is incremented and others subtracted, small changes to action likely to occur, bigger changes to not likely actions, multiplied with rewards -> if very likely to do an action with poor reward then will be updated a lot)
alpha beta reward and penalty parameters
alhpa =1 and beta=0 if cross learning
alpha = beta then linear learning

assume reward between 0 and 1
Does not take state information (stateless approach): Network of Learning Automata can

Network of Learning Automata

Study These Flashcards

To handle different states one could use a different learning automata for each
state.
Then use the average reward obtained since the last visit in the update
r = sum_t t_t/T-l
T is now
l was the last time this state was
visited

Cross Learning

Study These Flashcards

– trying to describe human learning behaviour. X for action i (instead of 𝜋(ai)). Capital R
for reward (instead of r). To make someone unlearn a bad habit, there needs to be very bad,
immediate ‘punishment’. Second part: always smaller than first part, because probability is [0,1].
‘doing something makes you do it again’. Doing one action lowers the probability of doing another
action, no matter the reward.

Regret Minimization

Stateless approach expandable through a network of learners Assumes hindsight knowledge of the best possible payoff, calculates a loss for taking action aj instead of the best action*: lj = R* - Rj Polynomial Weights Maintains a ser of weights wj, one for each action Updates weights according to the perceived loss ( 1 - alpa*loss) then are normalized for the policy to be legal

Joint Action Learning

Opponent modelling. Remember action-selection-count Cja for all other agents Learn Q-values for joint actions Coordination becomes possible in cooperative MAS Much more statistics need to be kept Assums that observing the actions of all agents is possible Might need to know about policy update mechanisms to make correct predictions.

Multi Agent Reinforcement Learning Flashcards

(26 cards)