Final Exam Flashcards

1
Q

What is an MDP?

A

A Markov decision process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the components of an MDP?

A

State, Action, Reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Is it possible to have multiple terminal states in an MDP?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Markov property?

A

Next state only depends on current state.Everything one needs to know about the environment is encoded in the current state. There is no need to access data from past steps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Value Iteration?

A

Using the Bellman equation to update the value of each state iteratively until convergence. Updating the policy once per iteration until convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a policy?

A

A policy is a mapping from states to actions for an MDP.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the meaning of the Bellman equations? What do they do?

A

“(the Bellman equation) expresses a relationship between the value of a state and the values of its successor states.” S&B 2020

OR

An update rule for the value of a state based on the seen reward and the expected value of the next state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is it possible that value iteration never converges on your formulation of MDP with a proper convergence criterion?

A

False. “(Policy and Value Iteration)… All of these algorithms converge to an optimal policy for discounted finite MDPs.” S&B 2020.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Does policy iteration or value iteration converge faster? Give an explanation of your answer.

A

Both algorithms are guaranteed to converge to an optimal policy in the end. Yet, the policy iteration algorithm converges within fewer iterations. As a result, the policy iteration is reported to conclude faster than the value iteration algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the two types of tasks modeled by MDPs; unify the notation used to describe these two tasks, and describe the type of state that allows you to unify them. (Hint: you may find section 3.4 of S&B helpful.)

A

Episodic - sequences of states that repeat … for example a card game gets to the end and then restarts with a shuffle/deal. And so would also be finite.

Continuing - sequences of states with no definable end. Like predicting the weather … the days continue with new weather each time … you never get to the last day and restart the episode.

You can unify them by making the last state in episodic tasks act as a sink that transitions to itself with probability 1.0 and reward = 0.

We use lambda to convert continuing tasks to simulate a finite length task; but, it would never become episodic as you’d never reset to the original state 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

True or False: The Markov property means RL agents are amnesiacs and forget everything up until the current state. Why?

A

True, the markov property states that an agent only needs the data of the currents state. However, the current state can have past experiences/observations included in the current state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True or False: In RL, recent moves influence outcomes more than moves further in the past. Why?

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True or False: In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change. Why?

A

True. Adding a constant to reward would not change the policy. This is not considered as reward shaping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

True or False: An MDP given a fixed policy is a Markov chain with rewards. Why?

A

True. This is because the agent would be traversing through a fixed set of states which will never change.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

True or False: It is not always possible to convert a finite horizon MDP to an infinite horizon MDP. Why?

A

False … can’t we always set the last state to loop back upon itself with prob 1.0. Then it can go on forever (and the reward would be 0).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or False: If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix. Why?

A

False, V(s) = max_a Q(s,a). Thus V(s) is the action that maximizes Q(s,a). If we already have all of the Q-values for each possible action of a given state, V(s) is simply the largest number in the set of Q-values. No transition matrix needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

True or False: The optimal policy for any MDP can be found in polynomial time. Why?

A

True - “A DP (Dynamic Programming) method is guaranteed to find an optimal policy in polynomial time even though the total number of (deterministic) policies is k^n” S&B 2020

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

True or False: A policy that is greedy – with respect to the optimal value function – is not necessarily an optimal policy. Why?

A

False. Optimal action on an optimal value function is optimal by definition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is TD? In which algorithms is TD used?

A

Temporal Difference. Classical TD has TD(0) - One-Step, TD(1) - Monte-Carlo, TD(lambda) - weighted look-ahead. A one-step TD update is the basis of all bootstrapping RL algorithms such as Q-learning, SARAS, DQN, PPO.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are model-free methods?

A

Methods that do not use transition probabilities and reward functions

“Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning” - S&B 2020

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the prediction problem?

A

Estimating/Predicting the value of each state.

As opposed to the Control Problem, in which we can take actions to impact the sequence of states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a n-step estimator?

A

This is an estimator of state values that looks ahead n time steps into the future.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a Monte-Carlo return?

A

The return of a full episode (sum of all rewards)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the value of the 5-step estimator of the terminal state? What about other n-step estimators of the terminal state?

A

0, because there are no future rewards for being in the terminal state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What’s the formula to view TD(\lambdaλ) as a function of n step estimators?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What can we say about the bias-variance trade-off between a 1-step estimator versus a Monte Carlo estimator (in other words, which estimator has a higher variance and which estimator has a higher bias?)

A

1-step estimator will have much higher bias due to bootstrapping whereas Monte Carlo will have much higher variance because it depends on a lot of signals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are some advantages of TD methods compared to dynamic programming methods? to Monte Carlo methods?

A

With TD you do not necessarily need a model, whereas with dynamic programming you do. TD reduces the variance of the expected return compared with MC.

Dynamic programming methods need a model, Monte Carlo methods will have high variance and TD methods can have high bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

True or False: In TD learning, the sum of the learning rates used must converge for the value function to converge. Why?

A

False, The sum of the learning rates must not converge but go to infitinty. However, the sum of the squares of the learning rates must converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

True or False: Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks. Why?

A

MC is an unbiased estimator, but it is also has high variance. TD methods are preferred because they minimise variance and also approximate the ML estimate of an MDP. TD methods can also make a trade-off between bias and variance through TD(lambda).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

True or False: Backward and forward TD(\lambdaλ) can be applied to the same problems. Why?

A

Technically True, but backward methods offer a convenient way of learning online when experience cannot be repeated or is expensive to generate. If episode data is cheap or stored, then using a backward view (with eligibility traces) does not have a major benefit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

True or False: Offline algorithms are generally superior to online algorithms. Why?

A

False, there are cases where either will do better. E.g. online algorithms can better deal with non-stationary tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

True or False: Given a model (T,R) we can also sample in, we should first try TD learning. Why?

A

false. If we are simply sampling from (transition, reward) then there is no concept of states to which to apply TD-methods, which require a temporally-ordered sequence of (S,R) to apply. If there is no temporal order to the model, then it makes no sense to apply a TD method. An MC method would be better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

True or False: TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations. Why?

A

False. TD(1) is MC. Assuming an update is applied at the end of an episode, it performs worse. The return is apportioned to each state proportionally to the number of times it is visited, which means it diverges with repeated presentations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

True or False: In TD(\lambdaλ), we should see the same general curve for best learning rate (lowest error), regardless of \lambdaλ value. Why?

A

False, empirically we see different curves. The best value of lambda will vary across problems. Short episodes may benefit from smaller values of lambda for example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

True or False: An update rule which is not a non-expansion will not converge without exception. Why?

A

False, non-expansions and contractions are the only things guaranteed to converge because the gradients will shrink to 0 as the operations are applied.

When saying that it is not a non-expansion, it only indicates that there is no guarantee that the updates will approach each other. However, I believe it as also not certain that the updates won’t approach each other. It might be that convergence still happens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

True or False: Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts. Why?

A

False, non-expansion means that the signal is shrinking monotonically. It’s the less restrictive version of a contraction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is an eligibility trace?

A

An eligibility trace is a weight value corresponding to how many time steps have passed since that state “participated in producing an estimate value” (Sutto Barto p.287).
The eligibility trace defines how much a given state estimate is updated.

It can be used to calculate TD updates more efficiently. It is used to generalize TD(0) and TD(1) methods to TD(\lambda) methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

In Sutton 88. What is “the maximum likelihood estimate of the underlying MDP” (quoted from the paper)?

A

“In the following section, we prove that in fact it is linear TD(0) that converges to what can be considered the optimal estimates for matching future experience - those consistent with the maximum-likelihood estimate of the underlying Markov process.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Intuitively, why would TD(0) converge to “the maximum likelihood estimate of the underlying MDP”?

A

My guess is that it’s because it is a low variance estimate with high bias and under repeated presentations that bias would go away and therefore it should converge?

40
Q

In Sutton 88, why does TD(0) do best in figure 3 of Prof. Sutton’s paper (you can use the intuition from the above question)?

A

Because TD(>0) algorithms will have variance and will therefore begin to be off. TD(0) on the other hand has no variance but is biased because of bootstrapping. On repeated presentations the bias is eliminated and therefore will converge with 0 error.

41
Q

In Sutton 88, what is Pt? What is ∇w​Pt?

A

Pt is the prediction at time t … the estimated state value

∇w​Pt is the gradient, which is the partial derivative with respect to the weights of the expectations … in the paper it was the value of the x vector, which was the input.

42
Q

In Sutton 88, what is the value of ∇w​Pt in the paper, if xt [0,0,1,0,0] under the context of a random walk example? Why?

A

∇w​Pt is xt… its use is to ‘turn off’ the all the weights for the ‘other’ states. Only the weight associated with the value 1 is updated or used for the calculations.

43
Q

In Sutton 88, is the TD algorithm used in the paper the same as the TD we saw in class?

A

While written differently, it matched the Semi-gradient TD(λ) for estimating algorithm in the book.

44
Q

In Sutton 88, experiment 1, what is the convergence criterion? What effect does it have on the TD algorithm?

A

IDK

45
Q

In Sutton 88, why doesn’t TD(0) do well in figure 5?

A

TD(0) updates slowly. This is because it’s a single step and unable to overcome it’s initial bias.

The other TD(lambda) learn from other steps so get more experience per iteration so it updates faster.

Per the paper, if there were more presentations, the handicap would go away.

46
Q

In Sutton 88, experiment 2, what kind of figure 4 would you expect if you were to give infinite sequences instead of 10? Why?

A

It would look more like Figure 3 as the experiment would be running to convergence. For 10 sequences, we are just discovering what can be learned at the start of training. Running infinitely, we’ll be more like experiment 1, running to convergence.

And I think the curves will all overlap. There is only one environment so there is really only one solution. The other curves in Figure 4 are because we’re starting from different conditions.

47
Q

In Sutton 88, why are all the curves in figure 4 convex?

A

My guess is the learning rate causing the alg to transition from undershooting to overshooting the true value given the calculated TD error term

48
Q

In HW3, we saw Sarsa is a ‘control’ algorithm? What is a ‘control’ task? Are any TD algorithms a ‘control’ algorithm?

A

“RL with actions”

A control task is one where the agent develops a policy that it uses to interact with the environment. Contrast this with a prediction method, which only learns to predict the return, V(s) or Q(s,a). TD algorithms are a way of learning to predict the value function, which control methods can then use to develop a policy e.g. by acting epsilon-greedily with respect to the value function.

49
Q

What is Sarsa? What’s the input and output of this function?

A

SARSA is an on-policy model-free control method. It learns Q using TD learning, and selects actions using an epsilon-greedy (or soft) policy. The input is the state, reward, next state and the output are actions selected based on the learned Q-function.

50
Q

Why is Sarsa on policy? How does this compare to an off-policy method like Q-learning by following a random policy to generate data?

A

SARSA is on-policy in that it selects the actions A, A’ using the current best policy.

SARSA selects the update value from the q-table by choosing Q(S’,A’) based on S’ and the POLICY(S’) action. Then it follows that (S’,A’) for the next iteration.

Q-learning selects the update value from the q-table by choosing the maximum Q(S’, any possible action) and ignoring what the current best policy would select. Then it follows the (S’, POLICY(S’)) for the next iteration. The POLICY(S’) may not match the action chosen by the previous iteration.

51
Q

In the assignment document, it says Sarsa uses TD to do control. What exactly does it mean? Which part of Sarsa is TD?

A

The value function estimation in SARSA is done with TD. The value function is used to do control by selecting actions (epsilon) greedily.

52
Q

One perspective of Sarsa is that it looks a bit like policy iteration? Can you tell us which part of Sarsa is policy evaluation and which part is policy improvement?

A

SARSA uses Generalised Policy Iteration i.e. take the current policy, evaluate it , update the policy using the new information or observed reward, repeat. The TD update is policy improvement, and the action selection is policy evaluation.

SARSA looks almost exactly like policy iteration except that it entirely uses the current policy to update values. Once it’s done with that it will do an improvement.

53
Q

True or False: Any optimal policy found with reward shaping is the optimal policy for the original MDP. Why?

A

True

Reward shaping won’t change the policy (so still finds the original optimal policy); but, it may find it faster.

54
Q

True or False: Potential-based shaping will always find an optimal policy faster than an unshaped MDP. Why?

A

False

If set up wrong can cause negative potential feedback loops

55
Q

True or False: Rmax will always find the optimal policy for a properly tuned learning function. Why?

A

False.

False, Rmax is not guaranteed to find the optimal policy. But Rmax can help obtain near optimal results.

If the estimates of the MDP are sufficiently close to the original MDP, then the policy will be near-optimal, not necessarily optimal. Rmax leads to exploration of unknown states, which generates knowledge that can be used to estimate the value of those states (using Hoeffding Bounds) over time. If the parameters defining when “enough knowledge is enough” lead to the estimate of the MDP being tolerably close to the original, then Rmax can be said to lead to a near optimal policy with a properly tuned learning function.

56
Q

True or False: Since Sarsa is an “online” algorithm, it will never learn the optimal policy for an environment. Why?

A

False. SARSA can learn the optimal policy but does less exploration than an off-policy algorithm. It is possible SARSA gets stuck in local optima, but not necessarily

57
Q

True or False: Policy shaping requires a completely correct oracle to give the RL agent advice. Why?

A

False. The Agent can assign a non-100% confidence in the correctness of the advice given. The Agent can also continue to learn from its own experiences

58
Q

Why can you find an optimal policy despite finding sub-optimal Q-values (e.g. via Q-learning)?

A

Yes

The policy space (selecting the best action) only needs Q-values to be ‘sorted’ appropriately. By that I mean that the sub-optimal Q-values still have the best action with the highest Q-value for a particular state. Then the best action won’t change even though the Q-value has farther to go to become optimal.

59
Q

Is it possible that Q-learning doesn’t converge? Is it possible that the point it converges to is not optimal?

A

Q-learning will always converge to optimal (and correctly valued) policies under 2 conditions:
- Given a sufficiently small learning rate- one that converges when squared and diverges when added. I.e, 1/2, 1/4, 1/8…
- Each state is visited infinitely often

https://stackoverflow.com/questions/59709726/criteria-for-convergence-in-q-learning

(Practically, the state-space may be too large for Q-learning to be computationally tenable in most real-world problems)

60
Q

What are the benefits of an off-policy algorithm?

A

An off-policy algorithm can produce accurate state-value estimates independent of how good its policy is

A benefit of this is that it may do more exploration and be less likely to get stuck in local-optima

https://datascience.stackexchange.com/questions/13029/what-are-the-advantages-disadvantages-of-off-policy-rl-vs-on-policy-rl

61
Q

True or False: The trade-off between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options. Why?

A

False

exploration is required to learn the reward distribution … and therefore the bandits with highest rewards. Too much exploration is wasting time on poor choices.

exploitation is required to receive rewards and get some benefit from the lessons learned so far. Too much exploitation is possibly missing better choices … or learning them too slowly.

62
Q

Observe that the biggest difference between P2’s Lunar Lander problem and HW4’s Taxi problem is that there are infinitely many states in Lunar Lander. What are some good methods to handle this case? What are their pros and cons?

A

The continuous states of the Lunar Lander problem can be dealt with using function approximation to approximate the Q-function.

One pro of this technique is that it allows us to generalize states we’ve seen to similar states that have not been encountered before.

A con of this technique is that the Q-learner is no longer guaranteed to converge when using a function approximator.

63
Q

We learned about reward shaping in class. Could it be useful for solving Lunar Lander? If so, why and how?

A

Yes, reward shaping can be used to help guide the policy search by encouraging the agent to take specific actions to follow a specific trajectory. needs to be a potential based rewarding scheme

64
Q

Let’s say you want to use a function approximator like we learned in class. What function(s) are you approximating? What’s the input of that function and what’s the output of that function?

A

In general, we can use function approximation for Value, Q, Policy, and Transition/Reward functions.

The input can either be the state action pair or just the state. The output in the case of state action pair is the value for that pair, if the input is just the state it could be just the value of that state or it could be the values for all actions for that state.

65
Q

Let’s say you want to use Q-learning with some function approximator. Recall that we learned a convergence theorem and we used that to conclude that Q-learning converges. Can we apply that theorem to prove that your Q-learning with some function approximator converges? Why or why not?

A

We cannot because Q learning only converges to the optimal values under the assumption that you visit each state-action pair an infinite number of times

66
Q

True or False: RL with linear function approximation will not work on environments having a continuous state space. Why?

A

True … for a continuous state space you have infinite states so you’d need infinite equations.

False … if you discretize the continuous space in to discrete states, then you can use a linear function approximation.

67
Q

True or False: With a classic update using linear function approximation, we will always converge to some values, but they may not be optimal. Why?

A

False, Baird’s star problem is not guaranteed to converge.

68
Q

True or False: Applying generalization with an “averager” on an MDP results in another MDP. Why?

A

True, any generalization of an MDP would result in another MDP.

69
Q

Problems that can be represented as POMDPs cannot be represented as MDPs.

A

False. You can go from an POMDP over some state-space to a larger MDP over a belief-space.

70
Q

We can represent any POMDP as a PSR, however some classes of POMDPs will be too large to deal with as PSRs.

A

False. Lot of algorithms will work here. PSR as noted is a great counter example.

71
Q

What is the KWIK framework? What are the strengths of this framework and its algorithms?

A

It is the “knows what it knows” framework. It provides a set of algorithms that can map directly the input X to output Y. It makes guarantees that it will never be wrong and instead just say it doesn’t know an answer if an out of distribution input is found. It can make bounding guarantees for how many examples it needs before it can make a non-unknown prediction.

72
Q

KWIK is a learning framework like PAC (probably approximately correct) learning. Why do we even need a learning framework like PAC or KWIK?

A

PAC - Correct labels for the outputs are given in training data and it will not make mistakes within a certain probability and be accurate to a certain degree. It is a good choice for supervised learning problems. It allows for some mistakes due to the approximate correctness

KWIK - It does not allow for mistakes, and can be given inputs where the outcome is incorrect. It also is trained with on-policy data. Limited by the number of hypothesis or requests/data it can make to the problem

73
Q

What’s the worst case number of “don’t know” answers in your algorithm for HW5?

A

The size of the hypothesis space minus 1:

|H| = N * (N - 1)

Bound = |H| - 1

N - The size of the observation space (i.e. in Homework 5, this was the number of patrons)

74
Q

What if instead of being given “fight_occurred” you are given “did peacemaker come?” and “did instigator come?” Can you create an KWIK algorithm that outputs fewer “don’t knows” than HW5?

A

Yes because you are given the required information directly. The worst case is now just N-1.

The worst case would be where all patrons come on first night and one drops out each succeeding night. For 8 patrons, you know on event 7 (worst case) when you’re left with two patrons. You’ve found both; but, not identified which is which. One more time with them split to identify them exactly. … so maybe worst case is N.

75
Q

True or False: KWIK algorithms are an improvement of MB (mistake bound) algorithms and as such will have same or better time complexity for all classes of problems. Why?

A

False. For some hypothesis classes, KWIK can take more time than MB since it predicts “don’t know” whereas MB can make a guess when uncertain.

76
Q

Let’s say we are doing a repeated version of rock paper scissors and we want to maximize the usual average payoff. What’s the minimax profile that we learned in game theory reloaded, of this new rock paper and scissors game?

A

I believe -1 because when forced to hava a pure strategy, you can’t protect yourself from the other player using the best action against you.

I think the Security Profile would be 0 with the best strategy being to choose random actions.

77
Q

Then what’s the feasible and “preferable region” of this new rock paper and scissors game?

A

a line between 1,-1 & -1,1 that also goes through 0,0

78
Q

What does the folk theorem / two player plot tell us? What are all the Nash equilibria in this new version of rock paper scissors?

A

I believe it would be for each player to have a mixed strategy to choose each action randomly?

79
Q

What’s the subgame perfect equilibrium of this game given by computational folk theorem? How do we find it in poly time claimed by computational folk theorem?

A

IDK check notes

80
Q

Why does the “folk theorem” help us to solve prisoner’s dilemma (i.e. so that people will not always defect to each other)?

A

By using threats, any payoff profile that strictly dominates the minimax profile can be realized.

81
Q

True or False: MDPs are a type of Markov game. Why?

A

True because MDP’s can be a generalized form of Markov Game’s when there is only 1 other opponent

82
Q

True or False: LP is the only way to efficiently solve an MDP for its optimal policy. Why?

A

False. Dynamic programming techniques like value iteration and policy iteration can solve an MDP for its optimal policy in polynomial time.

83
Q

True or False: The objective of the dual LP presented in lecture is minimization of “policy flow” (the minimization is because we are aiming to find an upper bound on “policy flow”). Why?

A

False, the objective is to maximize the “policy flow”.

84
Q

True or False: Nash equilibria can only be pure, not mixed. Why?

A

False, a pure Nash strategy is one without randomization in actions to take while a mixed Nash strategy is one where there is randomization (take action with probability p)

85
Q

True or False: An optimal pure strategy does not necessarily exist for a two-player, zero-sum finite deterministic game with perfect information. Why?

A

True. Even certain very simple games, like coordination game, have no optimal pure strategy.

86
Q

True or False: The “folk theorem” states that the notion of threats can stabilize payoff profiles in one-shot games. Why?

A

False. With any finite games the “folk theorem” will not apply as it becomes rational to cease cooperation before the last turn.

87
Q

True or False: If following the repeated game strategy “Pavlov”, we will cooperate until our opponent defects. Once an opponent defects we defect forever. Why?

A

False. This is the “Grim Trigger”… Pavlov cooperates whenever both players make the same choice (CC or DD), otherwise chooses D

88
Q

True or False: “Subgame perfect” means that every stage of a multistage game has a Nash equilibrium. Why?

A

False. It indicates that every subgame be a Nash equilibirum.

89
Q

How is an option defined?

A

An option is a generalization over an action. It can be the composition of multiple actions.

90
Q

How is SMDP defined? Why is it called semi-Markov?

A

SMDPs are an MDP where transition dynamics can depend on time. It is called semi-Markov because now the transition no longer depend on the current state but on the time elapsed in that state.

91
Q

Does SMDP generalize over MDP?

A

yes… depends on how you construct the options. Constructed poorly, you do not get the convergence guarantees that you do over the MDP

92
Q

Can we use value iteration on SMDP given a model (transitions and reward function)? Notice that not all options can be executed in all states. Is that a problem for our value iteration on SMDP?

A

IDK

False because if an option can not be execute for all states, it could result in an error when implemented saying the action is not valid. There would be a problem using plain value iteration when used on SMDP’s

93
Q

Is value iteration guaranteed to converge for an SMDP?

A

Yes

94
Q

Is value iteration guaranteed to converge to the optimal policy for an SMDP?

A

No, it only guarantees hierarchical optimality.

95
Q

Can we do Q learning on SMDP? If so, how do we do that?

A
96
Q

True or False: Options over an MDP form another MDP. Why?

A

True.

The options are really just a combination of many atomic actions, rewards, discounts, and state transitions. You could conceivably unravel any option to get back to a conventional MDP.

97
Q

Using the coco value, we calculate the side payments as: P = coco(u, u_bar). But why does u getting P amount of side payment make sense?

A

The Agents only cooperate if each benefits more from cooperating than acting alone. In many cases, cooperation leads to one Agent getting a high reward with the other Agent not being as well rewarded. Without a side payment from the Agent that was highly rewarded, the poorly rewarded Agent would not have reason to cooperate

The reward must be high enough that it’s the best option for the poorly rewarded Agent, while low enough that the well-rewarded Agent benefits more than acting alone even after paying out