Final Review pt. 5 Flashcards
Using the coco value, we calculate the side payments as: P = coco(u, u_bar). But why does u getting P amount of side payment make sense?
It makes sense in the case that by cooperating neither party is loosing more than by not cooperating. By taking the coco equation the maximum of both cooperating plus the maxmin (maximum negative reward inflicted on other agent) needs to be >0 for coco side payments to work. Each Coco agent will have an inverse side payment (if one agent gives one (-1) the other agent gets one (+1)).
True or False: “Sub-game perfect” means that every stage of a multistage game has a Nash equilibrium. Why?
True, a sub-game perfect equilibrium necessarily satisfies the one-shot deviation principle which states that there are no changes to the original strategy that would increase the payoff for any of the players.
True or False: If following the repeated game strategy “Pavlov”, we will cooperate until our opponent defects. Once an opponent defects we defect forever. Why?
False. This is grim trigger.
In pavlov, once the opponent defects we will defect until the opponent defects again. In which case, we will cooperate.
True or False: The “folk theorem” states that the notion of threats can stabilize payoff profiles in one-shot games. Why?
False, the folks theorem describes a set of payoffs that can results from Nash strategies in a repeated game. Therefore, it does not apply in one-shot game since it only has 1 episode.
True or False: An optimal pure strategy does not necessarily exist for a two-player, zero-sum finite deterministic game with perfect information. Why?
True.. a game can have mixed strategy in a perfect zero-sum game.
True or False: Nash equilibria can only be pure, not mixed. Why?
False. Nash equilibria is extensible even when players are adopting mixed strategies.
True or False: The objective of the dual LP presented in lecture is minimization of “policy flow” (the minimization is because we are aiming to find an upper bound on “policy flow”). Why?
False. The objective of the dual LP is to maximize the reward carried by the policy flow
True or False: LP is the only way to efficiently solve an MDP for its optimal policy. Why?
False
Many different methods are guaranteed to converge on an optimal policy. See SARSA and Q learning which use DP
True or False: MDPs are a type of Markov game. Why?
True. MDP can be considered as a single-agent Markov game, or it could be a multi-agent Markov game when there is only one agent contributes to the reward and transition function.
Why does the “folk theorem” help us to solve prisoner’s dilemma (i.e. so that people will not always defect to each other)?
True, considering the pavlov machine. Its been shown to be subgame perfect, unlike TFT, with any set of decisions at a given time eventually leading back to collective cooperation
Let’s say we are doing a repeated version of rock paper scissors and we want to maximize the usual average payoff. What’s the minimax profile that we learned in game theory reloaded, of this new rock paper and scissors game?
Do we assume symmetric payoffs or is there an unbalanced payoff structure similar to HW6? For symmetric payoffs I think the minimax profile is just (0, 0) for both agents to take actions with probability (1/3, 1/3, 1/3).
True or False: KWIK algorithms are an improvement of MB (mistake bound) algorithms and as such will have same or better time complexity for all classes of problems. Why?
False. The class of singletons H can be learned in a Mistake-bound model with mistake bound of only 1, while in the worst case KWIK can take 2^{n}-1 examples to learn since it doesn’t know the answer until it has seen its first positive example.
What if instead of being given “fight_occurred” you are given “did peacemaker come?” and “did instigator come?” Can you create an KWIK algorithm that outputs fewer “don’t knows” than HW5?
If we are given two answers “did peacemaker come” and “did instigator come” I think we can apply Algorithm 6 from the KWIK paper. Since the size of each hypothesis class is n, the combined KWIK learner has a worst-case bound of (1-k) + sum(Hi) = 2n-1
KWIK is a learning framework like PAC (probably approximately correct) learning. Why do we even need a learning framework like PAC or KWIK?
to improve efficiency of exploration
What is the KWIK framework? What are the strengths of this framework and its algorithms?
Know What It Knows, designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to. It can also be used for anomaly detection where it requires some degree of reasoning about whether a recently presented input is predictable from previous examples.