Final Exam (RL & Game Theory) Flashcards
What are the assumptions of “Sequence of Rewards”?
- Infinite Horizons: We have an infinite time to work finding the optimal policy. If time was considered then we would no longer have stationary policies, since the policy would change with however much time remained.
- Utility of Sequences: Stationary Preferences such that U(s0, s1, s2, …) > U(s0, s1’, s2’, …) then U(s1, s2, …) > U(s1’, s2’, …) Introduces the idea of actually discounting to make a infinite sum = finite value.
Bellman Equation: What is the true utility?
Immediate reward at state S AND the sum of all discounted rewards that it took to get to that state S.
Does the bellman equation, as an algorithm, converge? (As an algorithm Bellman -> Value Iteration)
Yes. You can start with random utilities for all the states and as you iterate you are adding more and more truth by updating the utility values. You will eventually converge, because you will only ever add “more” truth. You will overwhelm the past, incorrect, initialization since gamma < 1.
What is the difference between utilities and policies?
Think of policy like classification, where only the order matters and utilities/utility function as regression.
How is Q different from U (utility of a state)?
Q, unlike U, takes defines value as the value of arriving in some state S and performing an action A to leave S (proceeding optimally thereafter). U only takes the value of arriving to the state S.
U(s) = max(a) Q(s,a) pi(s) = argmax(a) Q(s,a)
Does Q-Learning Converge? If so, how and to what does it converge?
Yes.
Using the alpha, learning parameter, it learns incrementally. If learning occurs infinitely then you will converge to the expected value. Given that alpha sums to infinity, but alpha^2 is finite.
Q-learning then converges to the Bellman Equation.
How can Q-learning relate to optimization?
By deciding what actions to take during learning we can approach this with an optimization mindset.
- Simulated Annealing –> exploration vs exploitation tradeoff. Random Actions vs Taking the best action argmax(a) Q(s,a)
Define each part of the following: “2-player zero-sum finite deterministic game of perfect information”.
2-player: Simple two agents playing against each with the assumption that will attempt to MAXIMIZE their own reward.
Zero-Sum: If A wins 5, then B loses -5. That is, the rewards from A & B sum to some constant value
Finite: There is an end, the game is only played once
Deterministic: There is no chance/probability with a player’s action/transition.
Perfect Information: Both players know exactly what state they are in/what the state the other player is in.
Fundamental Results holds for what types of games? What is the definition of it?
It holds for 2-player zero-sum games of perfect information. (Deterministic vs. Probabilistic it will still work).
Fundamental Result states that minimax (A goes first) == maximin (B goes first) which will have the same result and there will ALWAYS be an optimal pure strategy for each player.
What happens when a game no longer has perfect information?
Minimax is not the same as maximin. Mixed strategies are introduced, which assign some probability P of a player taking some action.
How do you find the value of the game with mixed strategies?
You plot the lines given the do the action with some probability P what is the expected profit in terms of P. Find this line for all actions of the opponent. The maximum of the min shape of the lines is the value of the game.
What is strictly dominance?
For Player A (whose actions are row-wise). A row would be strictly dominant if that row is preferred for all columns, that is for all actions of the opponent. If a mixed strategy, then for Player B (whose actions are column-wise. A column would be strictly dominant if that column is preferred (highest value) for all rows, that is for all actions of the opponent.
In layman’s term, explain Nash equilibrium?
For n players, when they have reached a Nash Equilibrium then no player will ever want to change their strategy even if they knew beforehand what the other players were going to do. (works for Pure and Mixed Strategies)
What are the 3 fundamental theorems from the Nash Equilibrium.
- In n-player pure strategy game, if elimination of strictly dominated strategies eliminates all by 1 strategy. This remaining strategy is the Nash Equilibrium.
- Any Nash Equilibrium will survive the iterative elimination of strictly dominated strategies (works with 1).
- If there is a finite number of players and the game is finite (finite number of strategies) then there exists at least one Nash Equilibrium (note this means there could be multiple) (this works for mixed strategies)
Aside: For n repeated games there are n repeated Nash equilibria, we will never want to change.
What is mechanism desgin?
Figuring out how to change/alter utilities values of the game itself to try an invoke a certain behavior from a person/player.