Final Exam (RL & Game Theory) Flashcards

1
Q

What are the assumptions of “Sequence of Rewards”?

A
  1. Infinite Horizons: We have an infinite time to work finding the optimal policy. If time was considered then we would no longer have stationary policies, since the policy would change with however much time remained.
  2. Utility of Sequences: Stationary Preferences such that U(s0, s1, s2, …) > U(s0, s1’, s2’, …) then U(s1, s2, …) > U(s1’, s2’, …) Introduces the idea of actually discounting to make a infinite sum = finite value.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bellman Equation: What is the true utility?

A

Immediate reward at state S AND the sum of all discounted rewards that it took to get to that state S.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Does the bellman equation, as an algorithm, converge? (As an algorithm Bellman -> Value Iteration)

A

Yes. You can start with random utilities for all the states and as you iterate you are adding more and more truth by updating the utility values. You will eventually converge, because you will only ever add “more” truth. You will overwhelm the past, incorrect, initialization since gamma < 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between utilities and policies?

A

Think of policy like classification, where only the order matters and utilities/utility function as regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is Q different from U (utility of a state)?

A

Q, unlike U, takes defines value as the value of arriving in some state S and performing an action A to leave S (proceeding optimally thereafter). U only takes the value of arriving to the state S.

U(s) = max(a) Q(s,a)
pi(s) = argmax(a) Q(s,a)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Does Q-Learning Converge? If so, how and to what does it converge?

A

Yes.

Using the alpha, learning parameter, it learns incrementally. If learning occurs infinitely then you will converge to the expected value. Given that alpha sums to infinity, but alpha^2 is finite.

Q-learning then converges to the Bellman Equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can Q-learning relate to optimization?

A

By deciding what actions to take during learning we can approach this with an optimization mindset.

  • Simulated Annealing –> exploration vs exploitation tradeoff. Random Actions vs Taking the best action argmax(a) Q(s,a)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define each part of the following: “2-player zero-sum finite deterministic game of perfect information”.

A

2-player: Simple two agents playing against each with the assumption that will attempt to MAXIMIZE their own reward.
Zero-Sum: If A wins 5, then B loses -5. That is, the rewards from A & B sum to some constant value
Finite: There is an end, the game is only played once
Deterministic: There is no chance/probability with a player’s action/transition.
Perfect Information: Both players know exactly what state they are in/what the state the other player is in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fundamental Results holds for what types of games? What is the definition of it?

A

It holds for 2-player zero-sum games of perfect information. (Deterministic vs. Probabilistic it will still work).

Fundamental Result states that minimax (A goes first) == maximin (B goes first) which will have the same result and there will ALWAYS be an optimal pure strategy for each player.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What happens when a game no longer has perfect information?

A

Minimax is not the same as maximin. Mixed strategies are introduced, which assign some probability P of a player taking some action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you find the value of the game with mixed strategies?

A

You plot the lines given the do the action with some probability P what is the expected profit in terms of P. Find this line for all actions of the opponent. The maximum of the min shape of the lines is the value of the game.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is strictly dominance?

A

For Player A (whose actions are row-wise). A row would be strictly dominant if that row is preferred for all columns, that is for all actions of the opponent. If a mixed strategy, then for Player B (whose actions are column-wise. A column would be strictly dominant if that column is preferred (highest value) for all rows, that is for all actions of the opponent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In layman’s term, explain Nash equilibrium?

A

For n players, when they have reached a Nash Equilibrium then no player will ever want to change their strategy even if they knew beforehand what the other players were going to do. (works for Pure and Mixed Strategies)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 3 fundamental theorems from the Nash Equilibrium.

A
  1. In n-player pure strategy game, if elimination of strictly dominated strategies eliminates all by 1 strategy. This remaining strategy is the Nash Equilibrium.
  2. Any Nash Equilibrium will survive the iterative elimination of strictly dominated strategies (works with 1).
  3. If there is a finite number of players and the game is finite (finite number of strategies) then there exists at least one Nash Equilibrium (note this means there could be multiple) (this works for mixed strategies)

Aside: For n repeated games there are n repeated Nash equilibria, we will never want to change.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is mechanism desgin?

A

Figuring out how to change/alter utilities values of the game itself to try an invoke a certain behavior from a person/player.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does discounting play a role in Game Theory?

A

It is useful for the case playing a repeated number of games with an uncertain end. We can use gamma to denote the probability of terminating a game. With this we can find the expected number of rounds needed to the end (1/ (1-gamma))

17
Q

In general, what is the Tit-for-Tat Strategy?

A
  1. Always cooperate on the first round.

2. Then copy the opponent’s move afterwards.

18
Q

What is mutual best response?

A

Is a pair of strategies where the best response for each player is the same as the opponents. This allows for potentially multiple Nash equilibria.

19
Q

What is the Folk Theorem in Game Theory?

A

Describes a set of payoffs that can result from Nash strategies in repeated games.

Any feasible payoff profile that strictly dominates the minmax/security level profile can be realized as a Nash Equilibrium payoff profile with a sufficiently large discount factor (so that we don’t fall into the n repeated games pit)

20
Q

What is the use of a 2-player plot?

A

The convex hull, the shape cut out by each value from the actions for each player, determines which average payoffs are possible for some/any joint strategy. It finds the “feasible region”

21
Q

Define the minmax profile. How does it defer from the Security Level Profile? What is it used for?

A

It is a pair of payoffs (one for each player) that represents the payoffs that can be achieved by a player against an opponent who is acting in a malicious manner. (Play by assuming a zero-sum game). The security level profile is similar to the minmax profile but works for mixed strategies (minmax is for pure strategies).

It is used to find the “acceptable region” - what player can achieve guarantee itself in an adversarial situation. Think of it as an “preferable region”

22
Q

Why does the Folk’s Theorem max sense?

A

The opponent will always attempt to force you to the minmax (iterative prisoner’s dilemma this (D,D)) so for the both players it is better being told what to do.

23
Q

What is the Grim Trigger Strategy? In context, is the threat at all plausible?

A

Cooperate until the other player defects, then you defect always. (They’ve crossed the line and you won’t listen to them to go back to cooperating. This threat really isn’t plausible since you think you are foregoing the opponents reward to deal punishment but this also negatively hurts you.

24
Q

Define subgame perfect.

A

Subgame perfect, another type of equilibrium, is that players will always follow the strategy and that it will always give the best response.

To disprove, find a sequence of moves that one player might take that is better than what the machine strategy is suggesting. (Tit-for-Tat, Grim Trigger, etc.)

If plausible threats actually exist, then subgame perfect. If implausible threats then not subgame perfect.

25
Q

What is the Pavlov Strategy? What is unique about it?

A

Cooperate if both players agree, Defect if both players disagree. If opponent defects, then you defect (if cooperated before), continue to defect until the opponent decides to defect instead of cooperating.

It is subgame perfect, the plausible threat. It is worthwhile to punish your opponent when they defect against you. Since eventually the behavior will stabilize and both (Pavlov vs. Pavlov) will eventually end up cooperating.

26
Q

What is the Computational Folk Theorem?

A

For any non-zero sum 2-player game, you can build a Pavlov-like machine, a subgame perfect Nash Equilibrium for any game in polynomial time. Some mutual benefit must exist, that is the game CANNOT be zero-sum.

27
Q

Stochastic Games? How do the generalize down if some constraints are placed on them?

A

Generalize to MDPs and are formal model of Multi-Agent RL.

  1. Reward A = - Reward B –> then Zero-Sum Game
  2. if Player B becomes irrelevant –> then MDP
  3. If size of state S = 1 –> then repeated game.
28
Q

What are solutions to Stochastic Games. Zero-Sum? General-Sum?

A

Zero-Sum: minimax-Q

  • value iteration will work.
  • converges
  • unique solution, one answer, to Q*
  • policies can be computed independently. Each player can run their own minimax-Q and achieve the same policies.
  • update efficient = polynomial time
  • Knowing Q* we can derive a specific policy.

General-Sum: Nash-Q

  • everything above fails.
  • No unique solution since there can exist multiple Nash equilibria.