Multi Agent Reinforcement Learning 2 Flashcards
Markov Games
game that depends on state variables to summarize the history of the game. Value iteration
in Markov Games (multidimensional): every agent gets its own reward and the value of a state is equal to
maximum q-value over all actions in that state. T = transition probability. O is other players’ action. V(s’) is
value of follow-up state
Playing optimally with direct feedback uses transposed vectors x and y to represent the policies of the column
and row players, where policies are probability distributions. For example, X2 is the probability of playing
column 2. Expected payoff is the sum of all probabilities of choosing that cell (probs of choosing row i times
probs of choosing col j) times the reward of that cell. Use Q-values to estimate yT
x. If the col-player chooses a
fixed x, the row-player should play y such that min yTAx.
MiniMax Q learning
Learn: Initialize V and Q = 1 Initialize pi (proba to do an action) to uniform action proba alpha = 1 -> learning rate will decay
Choose action:
with proba explore, choose a random action
otherwise choose action depending on pi(s,a)
(epsilon greedy)
Learn: (abserve reward & state then adapt value)
after receiving reward r by doing action a in s and moving to s’
Q(s,a,o) = (1-alpha)*Q(s,a,o) + alpha * (r + gamma + V(s’)
Linear programming to solve
gamma is transition s to s’
We deal with a 3D Q table because we have our opponent and ourselves
MiniMax Properties
its Naive, it follows its own action proportion
design for zero sum game
Can be slow with LP at each step
Converge to safe strategy that does well if opponent sucks
Nash Q learning
less naive than minimax because the Q are Nash Q
General sum games where coordination or competitiveness is possible
Need to maintain our own Q and the Q of opponents
. Updates are based on other agents acting rationally
- Nash Q-values: optimal Q-values received in a Nash equilibrium. They’re defined on state-joint action
tuples and are the expected sum of discounted rewards when all agents follow a specified Nash
equilibrium strategies from the next time step onwards. (pi(s)Q(s) … proba of choosing action i with Q) - It’s a set of policies where each policy cannot be changed individually to better the position of the
agent. NashQT
j(s’): T = time, j = specific agent, s’ is certain state. Encodes what is supposed to come:
estimated value from state s’ onward when all agents play the NE.
Nash Properties
Agents must observe all rewards, not just their own
· Computing a NE for each agent each step seems like a lot of work
· Convergence guarantees are limited (have strong restrictions)
· Performs quite well, but is computationally intensive
Friend-or-foe Q-learning
Stronger convergence guarantees than Nash Q-learning. Distinguishes
between competitive and cooperative games, works for cooperative or zero-sum games only.
Cross learning, Regret Minimization, Boltzmaan Dynamic
E[delta x] = x[fi-fixi] + Sum xj[-fjxi]
update action and update the other actions
fi is the expected reward for action i
· We’re going to idealize it.
· Normally you pick one action and then the choice of that
action influences how the probability of that action changes
and how you should update the probabilities of the other actions. Ideally we would like to be
able to do all of the updates for all of the actions at each step with the correct probability.
· If we would execute action i, this is how that execution would influence that action (how it
would update) and how it would update when executing other actions (at the same time).
Want to act as if we try all actions at every time step. Calculates the expected change
Regret Minimization Dynamics: the gradient is basically cross learning divided by a formula that
modulates the learning rate according to best action’s update.
- Boltzmann Exploration Dynamics: uses bellman equation, but there is no state involved. Time is the
most important parameter, since we want to study how it changes over time for a certain action
frequency adjusted Qlearning
The policy determines the update frequency, and an action value is only updated when selected for
execution. This pollutes the gradient → the agents have no clue where to go (you mess with the gradient that
you are trying to use).
To get the gradient that we want, we adjust 𝛼 with a division factor. This is called frequency adjusted Qlearning and gives
However, because of this, if the probability of choosing action i is very small (xi is small), the Q-value may
become very large. This can be solved by multiplying the division factor with 𝛽 and taking the minimum of
that and 1
Gradient Ascent Learning
used for describing learning algorithms, but we know what we want it to be. Say
that V translates a policy into the policy-value (what it is worth to follow policy x), then we want to update x
according:
Policy Hill Climbing:
two learning rates 𝛼 and 𝛿 (if 𝛿 = 1 it’s like Q-learning). Idea: in a loop, select with
exploration and execute an action a from state s and observe r and s’. Calculate Q(s, a) and update 𝜋 (while
constraining it to a legal probability distribution).
Win or Learn Fast (WoLF)
policy hill climbing approach with adjustable learning rate 𝛿. It actually uses two
learning rates for policy adjustment, a small 𝛿𝑤 for when it’s winning and a larger 𝛿𝑙
for when it’s losing and
thus needs to learn fast. It compares the expected value of the current policy to the average policy. If the
current policy is doing better, choose a small learning rate 𝛿𝑤. If the current policy is doing worse, use the
large learning rate 𝛿𝑙.