Multi Agent Reinforcement Learning Flashcards

1
Q

Why multi-agent systems?

A

MAS
Because this allows you to delete the single point of
failure and distribute computational load.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why Reinforcement Learning for Multi Agents

A
MAS are
Very complex
Highly dynamic
Non-deterministic
Actions of one agent influences world model of others
Many tasks require cooperation
Need for individual level adaptation
=> Design difficult
RL can help make it for you
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Reinforcement Learning

A

Learns from interactions and experience not examples
Delayed rewards, no direct feedback, needs to explore
Situations to action mapping, policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Policy

A

A policy is a strategy that an agent uses in pursuit of goals. The policy dictates the actions that the agent takes as a function of the agent’s state and the environment. indicated by pi, pi(s) will choose the action at s.

a function that translates from a state, action pair to a probability (probability of executing that action in
that state) or function that translates from state to action. Tells the agent how to behave.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Single Agent Problem

A

Markov decision process, only needs info of current state
set of states
set of actions
unknown transitions SxA-> S
unknown reward: SxA-> r
Need to find policy such that the sum of rewards over time is maximized (V). There is a discount factor (0 to 1) gamma on the rewards for the sum to modulate weight of future rewards (and thus importance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Nondeterministic Environments and Policies

A
Probabilistic transitions (non deteterministic, same action in a state can lead to different outcomes) SxAxS = between 0 and 1
Probabilistic rewards (non deterministic, same action in a state can lead to different rewards)
Probabilistic Policy(non deterministic, might not take the same action in the same state): SxA -> between 0 and 1
Proability distributions should be stationary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Bellman Equation

A

Can use on Markov Decision Process (probability of taking certain actions in a state, past doesn’t matter)
Function that adds current reward plus future reward. Choose state that gives you the most current and future reward
r(S,pi(S)) + gamma V(transition(state, pi(state))
gamma weights future reward -> discount factor
requires the knowledge of transition and reward

can then apply it with policy iteration (initialize with final reward and other states to reach it are 0, policy set with specific actions)
To Do until convergence of state Values, then improve the policy (by changing actions on states)
actor critic
policy improvement will be the max argument
pi(s) = argmax(r(s,a)+gammaVpi(s,a)) (value iteration ?)
guarantee convergence to optimal

loop through all states
max a (R(s,a)+gamma sum P(s,a,s')V(s')
gamma is discount factor
P is probability
s', looping through all possible state we can lead to

if no proba then just remove P

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

value at a state

A

V(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

value at a state

A

V(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Value iteration

A

do policy improvement after only updating one state unlike policy iteration’although often use combined) which compute all values of state until convergence then adapt policy.
This is a combination of the Bellman
equation (not according to current policy) while looking for actions that can do this. It’s a critic-algorithm that
no longer does explicit iteration of policy
For a given state, select the action that maximizes the whole expression. 𝛿(s,a) is s’ → new state. Look for that
action that maximizes the sum of rewards.
guarantee convergence to optimal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

State-action values

A

Doesnt require the knowledge of transtion and reward (unlike policy/value iteration)
values are Q-values. Q-value = immediate reward + discount factor ×
Q-value of the best action that you can do in that state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Q Learning deterministic and non deterministic

A
Initialize the Q values randomly
intialize state
while do: choose action, observe r and next state, compute Q(s,a) = r+ gamma max_a Q(s',a')
gamma is discount factor
s is s'

immediate reward, not delayed

If non deterministic then the values won’t converge -> average new & historic Q-values (higher weight on history)

To choose action: Epsilon-greedy or Boltzmann Exploration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Epsilon Greedy & Boltzmann Exploration

A

Choose an action for Q learning
Random number, below exploration rate then pick random action, above rate then select action with highest Q value
make epsilon decay over time

Boltzman selection action with certain proba, T makes it decay over time

=> exploration vs exploitation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Multi-Agent Systems

A
  • Exponential state space S growth with the number of agents
  • All actions jointly influence state transitions and rewards: space of possible actions and joint-action
    outcomes grow exponentially with number of agents, and rewards can be shared or individual.

Credit assignment problem: in traditional RL the question is which actions led
to this rewards(temporal difference: what your experience is telling you (what you expect from environment), in multi-agent RL additionally the actions of which agents led to
this reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Rewards cooperating agents

A

Less learnability
- Full system reward: if successful, give reward
- Local reward: direct local feedback to learn quicker
- Difference reward: each agent gets information about its impact (based on rewards with the agent’s
and without the agent’s action)
More Learnability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Competing agents

A
  • Each agent has its own reward function
  • Deterministic policies can be exploited: need to learn mixed policies. If an agent handles according to
    its best policy, it will get stuck in that policy → it becomes predictable.
  • optimal policy depends on other agents (Best response policy)
17
Q

Best response policy

A

Competing agent strategy
policy that given the policy of other agents is optimal → given the strategies of my
opponents, I know that this is my best strategy.

Can result in Nash Equilibrium: collection of strategies such that no player can improve by changing its own policy. Each player’s strategy is a best response.
OR
Pareto optimality: strategy for which you cannot improve the situation of one, while keeping the situation of all other equal or better.

There can be multiple of each

Because the policies changes, the best response policy moves and thus this is not a Markovian anymore

18
Q

Non-stationary environments

A

Actions outcomes are jointly determined by all agents. Agents learn
simultaneously and an agent’s action doesn’t influence the state according to a
fixed probability distribution. This means this is no longer a Markovian Environment

19
Q

Symmetric game

A

reward structure looks exactly the same, whether you are the column player or the row
player, there is no difference (the action pair leads to the same reward for all players). I.e. prisoner’s dilemma (not stable).
Asymmetric is when the values differ in the cells.

20
Q

Zero-sum game

A

add the rewards of the two players in any of the outcomes, the sum is 0 (what 1 player earns, the other loses). Best strategy: random.

21
Q

Q Learning normal form games

A

independent learner → other agents are ignored; agent behaves as during
single agent learning and interactions with other agents are considered noise in a stochastic environment
Might still work but people have develop methods with the aim that they converge & behave rationally eg: Learning Automata

22
Q

Learning Automata

A

policy iteration in stateless environments, originally assume a finite action set.
Like Q-learning, uses immediate reward to update its action selection probabilities.
Each agent has a vector with number of actions dimensions that stores action preference, how likely it is to perform an action ()
values that are updated each time-step through a function (chosen action is incremented and others subtracted, small changes to action likely to occur, bigger changes to not likely actions, multiplied with rewards -> if very likely to do an action with poor reward then will be updated a lot)
alpha beta reward and penalty parameters
alhpa =1 and beta=0 if cross learning
alpha = beta then linear learning

assume reward between 0 and 1
Does not take state information (stateless approach): Network of Learning Automata can

23
Q

Network of Learning Automata

A

To handle different states one could use a different learning automata for each
state.
Then use the average reward obtained since the last visit in the update
r = sum_t t_t/T-l
T is now
l was the last time this state was
visited

24
Q

Cross Learning

A

– trying to describe human learning behaviour. X for action i (instead of 𝜋(ai)). Capital R
for reward (instead of r). To make someone unlearn a bad habit, there needs to be very bad,
immediate ‘punishment’. Second part: always smaller than first part, because probability is [0,1].
‘doing something makes you do it again’. Doing one action lowers the probability of doing another
action, no matter the reward.

25
Q

Regret Minimization

A

Stateless approach
expandable through a network of learners
Assumes hindsight knowledge of the best possible payoff, calculates a loss for taking action aj instead of the best action: lj = R - Rj
Polynomial Weights
Maintains a ser of weights wj, one for each action
Updates weights according to the perceived loss ( 1 - alpa*loss)
then are normalized for the policy to be legal

26
Q

Joint Action Learning

A

Opponent modelling. Remember action-selection-count Cja for all other agents
Learn Q-values for joint actions
Coordination becomes possible in cooperative MAS
Much more statistics need to be kept
Assums that observing the actions of all agents is possible
Might need to know about policy update mechanisms to make correct
predictions.