Final Flashcards

Question

What is the Q Function?

Answer 1

*Q(s, a) = R(s) + gamma sum_s' [T(s,a,s') max_a' Q(s',a')]* Value for arriving in a state *s*, leaving via *a*, then proceeding optimally thereafter.

Answer 2

* Reward * R(s), R(S,a), R(s,a,s') * mathematically equivalent * encompasses domain knowledge

Answer 3

* Eigenproblem * Find direction of maximal variance * Finds orthogonal directions * PC2 is orthogonal to PC1 * Properties * Global algorithm * Best reconstruction * no information lost (with all PCs kept) * Minimize L2 (squared) error moving from *n* to *m* dimensions * Each dimension has eigenvalue * non-negative * monotonically non-increase from PC1 to PC2 to PCn * Can through away higher eigenvalues (least variance) * eigenvalue of 0 is ignorable (0 entropy, meaningless) * well studied (fast algorithms) * Filter method * Can be difficult with classification if feature associated with label has low variance (the feature may be inadvertently thrown out)

Answer 4

* Generate random directions and then projects data out into those directions * works well * At what? * Before classification * Maintains signal in lower dimension * Picks up some correlations * projects in higher dimensional space than other algorithms like PCA * Advantage over PCA/ICA * fast

Answer 5

I(A,B) = H(A) - H(A|B) = 1 - 1 = 0 (no mutual info between them) H(A , B) = H(A) + H(B) = 1 + 1 = 2 H(A) = H(B) = -0.5 log\_2(0.5) - 0.5 log\_2(0.5) = 1 (Maximum entropy)

Answer 6

* mixed strategy * some distribution over strategies * pure strategy * mixed strategy with all probability mass on a single strategy * probability 1 of chosen strategy

Answer 7

* Monotonically non-decreasing likelihood * does not converge (in practice does) * in strange examples * will not diverge * can get stuck * random restart * works with any distribution (if E, M solvable) * expectation is expensive (generally)

Answer 8

* Forward * Pass all features to learner * Start with no features * Repeat until error doesnt reduce * Add feature to get best score * Backward * Pass all features to learner * Start with all features * Repeat until error doesn't reduce * Remove feature to get best score * Like hill-climbing

Answer 9

* Model * T(s,a,s') ~ Pr(s' | s, a) * probability of transition into *s'* given that you were in state *s* and took action *a* * rules of the game * physics of the world

Answer 10

* Knowledge Discovery * interpretability & insight * which features matter * allow lower dimensions for understanding * Curse of dimensionality * need exponentially more data for more features

Answer 11

* Repeat until convergence * Start w/arbitrary utilities * Update utilities based on neighbors * ***U\_prime(s)_t+1 = R(s) + gamma \* maxa sum_s' [T(s,a,s') U\_prime_t(s')]*** * update utility of s by looking at utilities of all other states * the R(s) is **truth** * Why do we converge? * Add truth to wrong.. more truth to wrong... more truth to wrong... converge to true as wrong is miniscule * Works because we propagate value across the states * Order matters, not absolute values * We don't care about the actual value, just need utility good enough

Answer 12

Amount of data needed is 2^N number of features Adding more features means exponentially more data is needed

Answer 13

* finds a projection that discriminates based on the label * like supervised learning * Not filter method like others, cares about label explicitly *

Answer 14

* delayed rewards * take many actions prior to reward * ex: chess * temporal credit assignment * minor changes matter * reward changes affect policy

Answer 15

* With probability gamma, game continues * else, game over * Expected # of rounds * Finite with gamma \< 1 * 1 / (1 - gamma) * reminiscent of discount factor

Answer 16

* iterated prisoner's dilemma strategy * cooperate on first round * copy opponents previous move thereafter * strategies * always defect * cooperate, defect, defect, defect, ... * always cooperate * cooperate forever * defect, cooperate, defect, cooperate, ... * cooperate, defect, cooperate, defect, ... * TFT (against another TFT agent) * cooperate forever

Answer 17

* Mutual Best Response * Pair of strategies where eaach best respone to other. Nash Equilibria * In the below chart, D - D and TFT - TFT are Nash Equilibria * TFT - TFT Cooperative Nash

Answer 18

* States * S * coordinates, descriptors, represent where we are

Answer 19

* Algorithm * Start with initial policy (arbitrary) * Evalute policy * calculate U_tfor policy (utility) * Improve Policy * pi_t+1 = argmax_a sum T(s,a,s') U_t(s') * U_t(s) = R(s) + gamma sum_s'T(s, pi_t(s), s') U_t(s') * n _linear_ (unlike VI) equations in n unknowns * Fewer iterations than Value Iterations * guaranteed to converge

Answer 20

* set up incentives to get particular behavior * economics/government * ex: tax breaks for mortgage

Answer 21

* True utility of a state * reward for the state plus discounted rewards * ***R(s) + gamma \* max_a sum_s T(s,a,s') U(s')*** * non-linear

Answer 22

* pi\* * maximizes expected reward * argmax\_pi E [sum_t _{to inf}: gamma^t R(S_t) | pi] * Utility of a state * not equal to the reward of state (immediate feedback) * long term feedback * expected reward for given state and all future states * E [sum_t _{to inf}: gamma^t R(S_t) | pi , s_o = s] * accounts for all delayed rewards

Answer 23

Take a set of objects and put into groups. Given: Set of objects, X; **inter-object** distances ( d(X,y) = D(y, x) ) Output: **Partition** Pd(X) = Pd(y) if x & y are in same cluster

Answer 24

* Information Retrieval (ad-hoc) "google problem" * Need: retrieve subset of documents based off of relevance * Features * words * Problem with words as features? * insufficient indicators * lot of words * polysemy - words mean multiple things * apple (fruit or company) * false positive * synonymy - many words mean same thing * car vs. automobile * false negatives

Answer 25

* Information Gain * depends on labels * Gini index * variance * entropy * doesn't depend on labels * NN and prune with low weights * "useful" * non-redundant/independent * get rid of x2 if x2 = x1 + x3

Answer 26

* The problem of pre-processing a set of features to create a new (smaller? more compact?) feature set, while retaining as much (relevant? useful?) information as possible * usually less features (almost always) * Linear combination of original features * Feature selection is a subset * Features into a new feature space

Answer 27

* Converges if: * s,a is visited infinitely often * learning rate sum = inf, sum² \< inf * next states have to be drawn from transition probabilities * rewards have to be drawn from reward function

Answer 28

* Using a logarthmic reduction in an unbalanced dataset to remove the majority class by half in each iterations, ultimately ending up with a small balanced dataset * As dataset size is reduced by half in each step, computational power can be double and end up the same (half the data, 2x computation is original to previous step) * In unbalanced dataset it is often important to get accurately classify the minority class

Answer 29

pi(s) = argmax_a sum_s' [T(s,a,s') U(s')]

Answer 30

* mixed strategy, similar to minmax profile

Answer 31

Measures effect on a particular predictor Usefulness ~ Error | Model/Learner

Answer 32

* Nash - Q * No longer using minimax games (not zero sum) * Nash Equilibrium operator instead of minimax * value iteration **doesn't** work * Nash - Q **doesn't** converge * **No** unique solutions * Policies **cannot** be computed independently * Nash Equilibrium is a joint behavior *

Answer 33

* Pi * policy * maps states to actions * _direct use, indirect learning_ * **policy search algorithms** * U * utility * maps states to value * * **value function based algorithms** * T,R * Can use to solve Bellman Equations * maps states, actions to new states and rewards * _direct learning, indirect use_ * **model-based learners**

Answer 34

* In a 2-player, zero-sum (non)deterministic game of perfect information * minimax = maximin * there always exists an optimal pure strategy for each player * Can determine optimal joint policy with matrix (without original tree) * Assumption * rational agent trying to maximize reward * other agents are doing the same rational thing

Answer 35

* You will fall into mutual cooperation no matter what * Can return to a cooperative state even after punishment * becomes a plausible threat

Answer 36

* Given n features -\> m features * where m \<= n * **Exponential** * ****there's an exponential number of subsets 2ⁿ * NP-Hard

Answer 37

* About finding independence (not correlation to maximize variance like PCA) * Transform features to be statistically independent from each other (mutual information is 0) * Maximize the mutual information between all of the original features and the original feature space * Underlying Assumption * mutually independent random variables combined linearly * observables are linear combinations of hidden variables * ex: cocktail party problem * microphones (observables) hear different linear combinations of voices (hidden variables) at party

Answer 38

* Filtering * input -\> set of features * selection algorithm * contains how well you're doing * output -\> fewer features * input to learning algorithm * **Search Algorithm doesnt use learner** * Wrapping * input -\> set of features * algorithm - selection & learning * assess features and get fewer * **Search Algorithm wrapped with learner**

Answer 39

* Mutual Benefit, as long as you cooperate * Stop cooperating (cross the line) * Deal out vengence forever * Create a nash equilibrium situation, no incentive to "cross the line" which decreases the reward

Answer 40

= 1bit \* 0.5 + 2bits \* 0.25 + 3bits \* 0.123 \* 2 letters = 0.5 + 0.5 + 0.75 = 1.75 This is an example of variable length encoding The computed value is called entropy

Answer 41

* Assumption: Data generated by * Repeat n times * One of K Gaussians, uniformly * Sample Xi from Gaussian * Find hypothesis h = that maximizes the probability of the data (maximum likelihood) * The maximum likelihood of the gaussian mu is the mean of the data (for one distribution) * What if there are k mus? * Hidden variables!

Answer 42

* Action * A(s), A * things that you can do in a particular state * gridworld: up, down, left, right * generalized as a function of state

Answer 43

* A * Not rich (cant have any combination of clusters) * SI - ordered * Consistent * B * Rich (any clusters) * Not SI - depends on unit * Consistent * C * Rich (any clusters) * SI (scaling is undone by theta and max) * Not consistent

Answer 44

| (using log base 2)

Answer 45

* partially observable markov decision processes * more realistic * separation between actual world and the decision makers idea of the state

Answer 46

Law of large numbers - the data turns normal, gaussian Linear combinations of statistically independent variables

Answer 47

start from observations of behavior, try to guess what reward function resulted in the behaviors

Answer 48

Data points can belonhg to more than one cluster probabilistic

Answer 49

* Yes * Start out cooperating (both agents), continue to cooperate, no reason to not cooperate

Answer 50

Use labeled training data to generalize labels to new instances "Feature Approximation"

Answer 51

* Q_est(s,a) (alpha) r + gamma \* max_a' Q\_est(s'.a') * utility of the state * Q_est * utility of the next state * max_a'Q_est(s',a') * alpha_t: learning rate

Answer 52

* minmax != maxmin * von neuman fails * A strategy depends on B and B depends on A * with hidden information you wont know * mixed strategy

Answer 53

Fair = 10 (need all bits) Unfair = 0 (need no bits, already know result)

Answer 54

more predictable = less information

Answer 55

* n players with strategies (s₁, s₂, ..., s_n) * particular strategies are in a nash equilibrium iff * for all strategies chosen, if you randomly choose a player to switch their strategy they would have no reason to (in equilibrium) * for all strategies chosen = argmax_si utility(s₁ ... s₂ ... s_n)

Answer 56

model-based approaches access to underlying T and R

Answer 57

* Pair of payoffs, one for each player, that represent the payoffs that can be achieved by a player defending itself from a malicious adversary * zero sum game * trying to lower the agents score * pure

Answer 58

* Uses minimax operator * Value iteration works * minimax-Q Converges * Unique solution to Q\* * Policies can be computed independently * 2 players can run minimax Q on their own * Update efficiently * polynomial time * linear programming for minimax * Q functions are sufficient to specify policy

Answer 59

prisoners will always defect (nash equilibrium) in the last attempt (expecting other player to diverge from the previous attempts to maximize their own payout)

Answer 60

* Only the present matters * Transition function in an MDP only depends on current state s * s' only needs s * Current state can be markovian if it remembers where you've been * Things are stationary * rules do not change * transition model doesn't change

Answer 61

Mutual Information ( I ) I (X, y) = H(y) - H(X | y) Measures the reduction of randomness of a variable given knowledge of another variable

Answer 62

A - 0 B - 1 1 0 C - 1 1 1 D - 1 0

Answer 63

Mathematical Framework to allow us to compare probability density functions. Allows us to determine what input variables provide more information for a response variable

Answer 64

* Planner * Input: Model (T,R) * Output: Policy (pi) * Learner * Input: Transitions * Output: Policy * Modeler * Input: Transitions * Outout: model * Simulator * Input: model * Output: transitions

Answer 65

* Pros * Take into account model bias and learning * Cons * Soooo slow

Answer 66

* Mathematics of conflict * conflicts of interest * Single Agents -\> multiple agents * Economics ( & politics) * Increasingly a part of AI/ML

Answer 67

O(n³) * Repeat k times (n/2) * Look at all distances to find closest O(n²) pair that have different labels

Answer 68

* Supervised learning * function approximation * given y,x pairs find function f * unsupervised learning * given x's and goal is to find f * get description based on features * clustering, description * reinforcement learning * given x's and z's and learn some f to generate y's * z - rewards * mechanism for decision making

Answer 69

* Policy * pi(s) -\> a * pi\* * optimal policy * maximizes long term expected reward * Function that takes in a state and returns the action you should take

Answer 70

Joint Entropy (Randomness contained in 2 variables together) H(X,y) = - sum [ P(X,y) log P(X, y) Conditional Entropy (Entropy of y given X, Randomness of y given X) H(y | X) = - sum [ P(X,y) log P (y | X )

Answer 71

Make sense out of unlabeled data "data description"

Answer 72

# * Hierarchical Aglomerative Clustering Algorithm (HAC) * Algorithm: * repeat n-k times to make k clusters * consider each object a cluster (n objects) * define intercluster distance as distance between the cloests two points in the two clusters * merge two closest clusters

Answer 73

* Cooperate if agree, defect if disagree

Answer 74

* Algorithm: * Repeat until convergence * Pick K Centers (random) * Centers claim closest points * Recompute the centers by averaging clustered points * Note: Center is not always a point

Answer 75

* randomized optimization algorithm * similar to MIMIC

Answer 76

Policy search based algorithms

Answer 77

* Decision trees are a filtering algorithm * Use information gain * features that provide most information given class labels

Answer 78

* Yes * Synchronize to mutually cooperative state no matter the historical sequence * Average reward is mutual cooperation * C, C * C, D * D, C * D, D

Answer 79

* Game: 2 player, bimatrix game, average reward repeated * Can build a pavlov-like machine for any game and construct subgame perfect Nash equilibrium for any game in polynomial time * Pavlov if possible * Zero-sum like (solve an LP) * at most one player improves

Answer 80

to determine if input vectors are similar

Answer 81

* Q-Learning is a family of algorithms that differ: * How to initialize Q\_est? * all awesome "optimism in the face of uncertainty" * Q will explore less visited "still awesome" states * How to decay learning rate? * how to choose actions? * always choose a_o - wont learn * always choose randomly - wont use it * use Q\_est to choose actions - wont learn (if local min) * greedy * use Q\_est with random restarts * slow * use Q\_est with "simulated annealing" like approach * random actions sometimes * random action with p = epsilon, q\_est action with 1 - epsilon

Answer 82

U(s) = max_a Q(s,a) pi(s) = argmax_aQ(s,a)

Answer 83

* repeated stochastic games (folk theorem) * cheaptalk -\> correlated equilibria * nothing said is binding * players can coordinate * can compute a correlated equilibrium (more efficient to compute than nash) * Cognitive hierarchy -\> best responses * instead of solving for equilibrium, assume other players have limited computational resources then take best response of what you believe other players will do * more easily computed, assuming other player is fixed * side payments (coco values) * give side reward to encourage actions

Answer 84

* Pros * Faster than wrapping * Cons * Ignores the learning problem * Look at features in isolation

Answer 85

* prisoners dilemma * choose option that is not best for the group * over fear of other agent making bad choice * see image * (-6, -6) option is chosen, both agents will never cooperate

Answer 86

* 2 phases * Expectation * Likelihood that data i comes from cluster j * E[Zij] = P(X = xi | mu = mu\_j) / normalization * Maximization * Compute the means from the cluster * mu\_j = sum\_i E[Zij] \* Xi / normalization * K-means if cluster assignments use argmax * Note: With EM probabilities, even points clearly in one cluster have a non-zero probability of the other cluster

Answer 87

* Always best response independent of history * if we had history of the response of each agents and one of the responses was not the best, not subgame perfect equilibrium * Example: Grim Trigger vs. Tit-For-Tat is not Subgame perfect * If TFT decides to defect first, Grim defects * If Grim cooperated anyway, grim would do better * Grim makes a threat and follows through with the threat, resulting in worse outcome than if it hadn't followed through * Grim: C | D , D, D, D, ... * TFT: D | C, D, D, D, ...

Answer 88

* If GLIE "Greedy limit + infinite exploration" (decayed epsilon) * Q\_est converges to Q * pi\_est converges to pi\* * Exploration - Exploitation dilemma * Exploration - learn * exploitation - use what you know * trade-off

Answer 89

* Error can only go down * Move point into new partition iff the error goes down * Average minimizes the squared error, moving the center would never increase error * Monotonically non-increasing in error * Each iteration is polynomial ( k\*n per iteration ) * Converges in finite time (kⁿiterations) * Need some way of breaking ties * If a point is equally far from both partitions * Consistently break ties * Only look at configuration once * Can get stuck (ties)

Final Flashcards

(113 cards)