final exam review Flashcards

Question

T/F: in a repeated game, the last state is only one that matters

Answer 1

True, because eventually the NE will dominate. Even though we've built up trust eventually we will defect because it is in our best interest. n repeated game -> n repeated NE

Answer 2

- cooperate on the 1st round | - each round thereafter, copy the opponents previous move.

Answer 3

False. You should always defect for low gamma because there will be a small number of games being played. For high gamma, you should always cooperate when facing TFT.

Answer 4

a strategy based on deterministic states where the choices of one players impacts their own payoff and the future decisions of the opponents.

Answer 5

in repeated games, the possibility of retaliation opens the door for cooperation. Any feasible payoff profile that strictly dominates the minimax/security level profile can be realized as a NE payoff profile, with sufficiently large gamma. This is proved because if the strategy strictly dominates the minimax profile, it can be used as a threat. the player is best off doing what it is told.

Answer 6

it is the pair of payoffs, one for each player, that represent the payoffs that can be attainted by a player defending itself from a malicious opponent. - this can be solved as a zero-sum game. - this can be considered the feasible region of a 2 player game.

Answer 7

these are payoffs in an acceptable region that are better than what each player can guarantee themselves in an adversarial situation.

Answer 8

- each player takes the best response independent of the history in the game.

Answer 9

- game is not SGP if there exists some history of actions given to players and there is a point where one of the players is not taking a best response - player makes a threat, but when it comes time to act on the threat, it is doing something that is worse for itself than what it would do otherwise.

Answer 10

- cooperate, but if ever defect, always defect thereafter. - strategy that can be used to prove folk theorem - implausible threat -> always taking vengeance and forgoing rewards is irrational and unreasonable

Answer 11

False. this game is not subgame perfect because grim trigger will defect in one situation where it should cooperate.

Answer 12

Pavlov v. Pavlov | average reward for this game is mutual cooperation

Answer 13

implausible threat - taking vengeance and forgoing rewards without a way to improve it for yourself. plausible threat - defect knowing that you will be ok/get a positive state afterwards.

Answer 14

- utilize joint actions of the players | - change the discounted expected value of the next step to be the minimax of the Q value projection state action pair:

Answer 15

- VI works - minimax Q converges - there exists a unique solo to Q* - policies can be computed independently - update efficient - Q fans are sufficient to specify policy

Answer 16

when changing the discounted expected value of the next step to utilize the nash equilibrium of a joint action, the solver loses all leverage. - VI does not work b/c the fxn does not converge - minimax-q does not converge - there no unique solo to Q* - policies cannot be computed independently bc Nash is joint behavior. - update is not efficient - q fxns are not sufficient to specify the policy.

Answer 17

- an impossibility theorem telling us that a general-purpose universal optimization strategy is impossible. - the only way one strategy can outperform another is if it is specialized to the structure of the specific problem under consideration - 'if anything is possible, then nothing can be expected' - any two optimizations algorithms are equivalent when their performance is averaged across all possible problems

Answer 18

if we cannot make any prior assumptions about the optimization problem we are trying to solve, no strategy can be expected to perform better than any other.

Answer 19

- the rows of the problem matrix (P) are the strategies - the columns are the universe of all possible problems. - the entries are the performances of those strategies on the problems.

Answer 20

True. If we don't have any prior assumptions about the function we are searching through, then we can expect to do no better than on average

Answer 21

Markov decision process. | State, model, actions, rewards, policy.

Answer 22

False. For MDP's, only the present mattes. The rules are stionary.

Answer 23

in the model, the current state can remember everything with more information form the past. Add history by adding more information to your state.

Answer 24

False! Reward is immediate Utility includes reward, but also the discounted expected value of future state stemming from the current. utility is all about delayed reward (temporal credit assignment).

Answer 25

R_Max/(1 - gamma)

Answer 26

1. starting from goal state, calc value of the neighbor states w/ref to the discount factor. 2. for subsequent iterations, VI considers states further and further from the goal. 3. VI converges when: - VI converges on fixed values for each state - policy is found by taking argmax wrt possible actions for each state.

Answer 27

False. VI uses an argmax when finding the discounted expectation of utility given the original estimate.

Answer 28

1. start with an initial guess at a policy (Random) 2. evaluate the given policy for all states described by the policy. 3. after evaluation, PI explores different action that could improve the value assigned to each state. - if value improves w/new action, update the policy. 4. repeat 2 and 3 until the policy does not change. this is a linear calculation with n equations in n unknowns.

Answer 29

- VI is non-linear while PI is linear. - PI is linear because rather than finding the max over all actions, we already know the action as determined by the current policy. - PI can take fewer iterations to converge, but each iteration is more computationally expensive when compared to VI.

Answer 30

Q-learning is not provided any domain knowledge in terms of the transition probabilities and rewards.

Answer 31

the value for arriving in state S, leaving via action a, and adding the discounted expected value for taking a to s'. once you are in s', take the highest Q-value and proceed optimally thereafter.

Answer 32

LR must go to infinity in limit. | LR^2 must be finite in limit.

Answer 33

- assign random q-values or zero q-values to environment. - run Q-update by taking current state, action pair and update a small amount (alpha) in direction of immediate reward and add the discounted estimated value of the next state. - choose action for Q by taking a simulated annealing approach. Take random action or best action depending on parameter epsilon. - repeat until convergence.

Answer 34

assume that system is a deterministic MDP reward is bounded agent select actions such that it will visit each state action pair an infinitely often.

Answer 35

describes a fundamental tradeoff in reinforcement learning. - how can you balance learning about your environment and using (getting rewards from) your environment - these are conflicting objectives that are limited by computation constraints, time, data acquisition, etc.

Answer 36

simulated annealing like approach for action selection. take a random action, sometimes a mixture of choosing randomly and using Q-hat. - choose optimal action w/ prob 1 - e - chose random action w/prob e as long as e > 0 and small and MDP is fully connected, you can visit each an infinite amount of times.

Answer 37

each action can lead to a state. all states can be reached by an action from other states.

Answer 38

take set of features and uses an optimization algo to reduce the number of features and provide to the learner.

Answer 39

pro: - FAST con: - no learner feedback - review features in isolation, no feature pairings

Answer 40

takes features and feeds subsets of them to a learner learner reports how well the subset performed update the subset of features repeat until added features do not improve

Answer 41

pros: - learner gets good features through communicating with the search algo. - this technique takes advantage of the learner's bias. cons: - SLOW

Answer 42

``` Info gain, variance, entropy, "useful features (useful as deemed by the learner) independent/non-redundancy ```

Answer 43

strongly relevant - x_i is strongly relevant if removing it degrades the bayes optimal classifier weakly relevant - x_i is not strongly relevant - x_i can pair with other elements to improve the BOC. - Some subset of features S where the addition of x_i improves the performance of the BOC irrelevant - x_i provides no info to the BOC

Answer 44

BOC is the best you can do on average. This does not correspond to any particular classifier, but rather the best one for this use case. BOC takes the weighted average of all hypotheses, based on the probability of the correct hypothesis given the data.

Answer 45

False Relevance describes the information gained by a feature when used in a learner. Usefulness describes the error decreased by a feature when used in a learner.

Answer 46

Usefulness measures the effect on a particular predictor/classifier It can be measured as the error given a particular model/learner. As usefulness increases, this decreases the error for your learner

Answer 47

the problem of pre-processing a set of features to create a new features set, while retaining as much info as possible.

Answer 48

False Feature selection does return a subset of the features Feature transformation creates a new single feature that is a linear combination of the original subset

Answer 49

In information retrieval: polseny are words with multiple meanings -> gives false positives synonomy are many words that mean the same thing -> give false negatives

Answer 50

find the direction of maximal variance of the data. PCA also finds the directions that are mutually orthogonal. This can be considered as maximizing the uniqueness of the vectors so they aren't close together. Each vector tells a different unique story about the data. PCA transforms the data to a new space where feature selection can work.

Answer 51

- Provides the best reconstruction error -> minimizes L2 error when moving to a smaller dimensional basis. - maximizes the variance - orthogonal: global algorithm that finds the perpendicular (in 2D case) components - eigenproblem: each PCA has an eigenvalue that monotonically non-increases as PCA # goes up - well studied so fast implementations exist. - produces ordered features: based on eigenvalues - finds global features

Answer 52

means there is zero entropy and the feature is irrelevant

Answer 53

filtering because you remove components with small eigenvalues

Answer 54

ICA tries to maximize independence through a linear transformation. Given hidden variables X and observables (Y), ICA wants to find mutually independent features, and maximize mutual information between the original data and newly found features. the observables use the mutual information to reconstruct mixed data (aka transformed projection) to get independence. ICA tries to get feature mutual independence without losing any information from the output (observables). Think about the cocktail party problem.

Answer 55

talkers (hidden variables) and microphones (observables) microphones pick up a little bit of every talker (linear combination) how can we extract individual voices from the combined data. we desire information between the mic's and the voices to be as high as possible (max mutual info) w/each feature being independent

Answer 56

hidden variables: causing events to happen observables: recording the events. Each observable contains a linear combination of all the hidden variables. Given the observables, can reconstruct the hidden variables.

Answer 57

- mutual independence between the newly transformed features - maximizes mutual information among the newly transformed features and the original data/features - find local features - directional

Answer 58

True If the distributions are all gaussian, PCA and ICA can produce the same features because PCA will be able to find uncorrelated features.

Answer 59

the addition of many linear combinations of mutual independent variables provides a gaussian in the limit.

Answer 60

PCA: brightness (usually thrown out bc it is the avg), average face -> global features ICA: noses, eye selectors, hair selectors -> Local features

Answer 61

they all find and give insight to the fundamental features of your data

Answer 62

It fast boi

Answer 63

finds projections/features that discriminates based on the label. in the binary case, LDA is similar to the sum. LDA uses the values of the projections to re-represent the data.

Answer 64

False: PCA desires gaussian distributions ICA works well on non-gaussian data.

Answer 65

1. assumes all bases vectors are orthonormal. | 2. the directions with the largest variances are the most "important" or most principal.

Answer 66

Scale invariance, consistency. | Not rich b/c there is a fixed number of clusters

Answer 67

richness, consistency | no scale invariance due to the fixed distance separation

Answer 68

Richness, scale invariance | Not consistent bc as w increases there can be a point where theta/w is so small nothing gets clustered.

Answer 69

E-step can be interpreted as constructing a lower bound to the posterior distribution. This is a "soft" assignment that, once computed, assigns a posterior probability to each possible association of each individual sample M-step optimizes this bound, thereby improving the estimate for the unknowns. The lower bound is maximized, and the corresponding new estimate is guaranteed to lie closer to the location of the nearest local maximum of the likelihood.

final exam review Flashcards

(96 cards)