Multi Agent Reinforcement Learning 2 Flashcards

1
Q

Markov Games

A

game that depends on state variables to summarize the history of the game. Value iteration
in Markov Games (multidimensional): every agent gets its own reward and the value of a state is equal to
maximum q-value over all actions in that state. T = transition probability. O is other players’ action. V(s’) is
value of follow-up state

Playing optimally with direct feedback uses transposed vectors x and y to represent the policies of the column
and row players, where policies are probability distributions. For example, X2 is the probability of playing
column 2. Expected payoff is the sum of all probabilities of choosing that cell (probs of choosing row i times
probs of choosing col j) times the reward of that cell. Use Q-values to estimate yT
x. If the col-player chooses a
fixed x, the row-player should play y such that min yTAx.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

MiniMax Q learning

A
Learn:
Initialize V and Q = 1
Initialize pi (proba to do an action) to uniform action proba
alpha = 1 -> learning rate will decay

Choose action:
with proba explore, choose a random action
otherwise choose action depending on pi(s,a)
(epsilon greedy)

Learn: (abserve reward & state then adapt value)
after receiving reward r by doing action a in s and moving to s’
Q(s,a,o) = (1-alpha)*Q(s,a,o) + alpha * (r + gamma + V(s’)
Linear programming to solve

gamma is transition s to s’

We deal with a 3D Q table because we have our opponent and ourselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

MiniMax Properties

A

its Naive, it follows its own action proportion

design for zero sum game

Can be slow with LP at each step
Converge to safe strategy that does well if opponent sucks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Nash Q learning

A

less naive than minimax because the Q are Nash Q

General sum games where coordination or competitiveness is possible

Need to maintain our own Q and the Q of opponents
. Updates are based on other agents acting rationally

  • Nash Q-values: optimal Q-values received in a Nash equilibrium. They’re defined on state-joint action
    tuples and are the expected sum of discounted rewards when all agents follow a specified Nash
    equilibrium strategies from the next time step onwards. (pi(s)Q(s) … proba of choosing action i with Q)
  • It’s a set of policies where each policy cannot be changed individually to better the position of the
    agent. NashQT
    j(s’): T = time, j = specific agent, s’ is certain state. Encodes what is supposed to come:
    estimated value from state s’ onward when all agents play the NE.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Nash Properties

A

Agents must observe all rewards, not just their own
· Computing a NE for each agent each step seems like a lot of work
· Convergence guarantees are limited (have strong restrictions)
· Performs quite well, but is computationally intensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Friend-or-foe Q-learning

A

Stronger convergence guarantees than Nash Q-learning. Distinguishes
between competitive and cooperative games, works for cooperative or zero-sum games only.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cross learning, Regret Minimization, Boltzmaan Dynamic

A

E[delta x] = x[fi-fixi] + Sum xj[-fjxi]
update action and update the other actions
fi is the expected reward for action i

· We’re going to idealize it.
· Normally you pick one action and then the choice of that
action influences how the probability of that action changes
and how you should update the probabilities of the other actions. Ideally we would like to be
able to do all of the updates for all of the actions at each step with the correct probability.
· If we would execute action i, this is how that execution would influence that action (how it
would update) and how it would update when executing other actions (at the same time).
Want to act as if we try all actions at every time step. Calculates the expected change

Regret Minimization Dynamics: the gradient is basically cross learning divided by a formula that
modulates the learning rate according to best action’s update.

  • Boltzmann Exploration Dynamics: uses bellman equation, but there is no state involved. Time is the
    most important parameter, since we want to study how it changes over time for a certain action
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

frequency adjusted Qlearning

A

The policy determines the update frequency, and an action value is only updated when selected for
execution. This pollutes the gradient → the agents have no clue where to go (you mess with the gradient that
you are trying to use).
To get the gradient that we want, we adjust 𝛼 with a division factor. This is called frequency adjusted Qlearning and gives
However, because of this, if the probability of choosing action i is very small (xi is small), the Q-value may
become very large. This can be solved by multiplying the division factor with 𝛽 and taking the minimum of
that and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Gradient Ascent Learning

A

used for describing learning algorithms, but we know what we want it to be. Say
that V translates a policy into the policy-value (what it is worth to follow policy x), then we want to update x
according:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Policy Hill Climbing:

A

two learning rates 𝛼 and 𝛿 (if 𝛿 = 1 it’s like Q-learning). Idea: in a loop, select with
exploration and execute an action a from state s and observe r and s’. Calculate Q(s, a) and update 𝜋 (while
constraining it to a legal probability distribution).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Win or Learn Fast (WoLF)

A

policy hill climbing approach with adjustable learning rate 𝛿. It actually uses two
learning rates for policy adjustment, a small 𝛿𝑤 for when it’s winning and a larger 𝛿𝑙
for when it’s losing and
thus needs to learn fast. It compares the expected value of the current policy to the average policy. If the
current policy is doing better, choose a small learning rate 𝛿𝑤. If the current policy is doing worse, use the
large learning rate 𝛿𝑙.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly