CS 7641 Final Flashcards

Question

How is expectation maximization like k-means

Answer 1

k-means if cluster assignments use arg max (prob of 0 or 1). It improves in a probabilistic metric like k-means improves in squared error metric

Answer 2

- Monotonically non-decreasing likelihood (it's not getting worse) - Does not converge b/c infinite configurations b/c infinite number of probabilities (partially does) - Will not Diverge (numbers don't blow up and become infinitely large) - Can get stuck (local optima) - Works with any distribution (If EM solvable)

Answer 3

- Richness - Scale Invariance - Consistency

Answer 4

For any assignment of objects to clusters, there is some distance matrix D such that your clustering algorithm returns that clustering.

Answer 5

scaling distances by a positive value does not change the clustering

Answer 6

Shrinking intra cluster distances and expanding intercluster distances does not change the clustering

Answer 7

No clustering can achieve all three of richness, scale invariance, consistency

Answer 8

Data needed grows exponentially with features (2^n)

Answer 9

np-hard, 2^n, exponential

Answer 10

filtering and wrapping

Answer 11

set of features as input, passes to same search algorithm that outputs fewer features

Answer 12

set of features, search over some subset of features, hands to leaning algorithm which reports how well those features performed

Answer 13

speed: faster than wrapping because you don't have to worry about what the learner does. Ignores the learning problem

Answer 14

look at features in isolation.

Answer 15

takes into account model bias and learning itself

Answer 16

extremely slow compared to filtering

Answer 17

- information gain - variance, entropy - "useful features" - independent/non redundant - ex: DT, NN (trim features with low weights)

Answer 18

- hill climbing - randomized Optimization - forward search - backward search

Answer 19

Look at all features in isolation, keep whatever feature is best. Then look at adding any of the remaining features and choose whichever is best in combination with first feature, etc. This is similar to hill climbing

Answer 20

start with all the features and figure out which one can you eliminate, one at a time until you get to a subset that performs well

Answer 21

xi is strongly relevant if removing is degrades the Bayes optimal classifier

Answer 22

Xi is weakly relevant if - not strongly relevant - and there exists some subset of features such that adding xi to the subset improves the bayes optimal features

Answer 23

Bayes optimal classifier. INFORMATION

Answer 24

a particular predictor. ERROR given model/learner (c is useful in neural nets)

Answer 25

the problem of pre-processing a set of features to create a new (smaller? more compact?) feature set, while retaining as much (relevant? useful?) information as possible

Answer 26

transformation: linear combination of original features like 2x1 + x2 as 1 feature. selection: have x1, x2, x3, x4 and only take x1 and x2

Answer 27

a whole bunch of databases and you want to retrieve the subset of documents that are relevant to the query (features) ad hoc - you don't know what they queries are

Answer 28

words can have multiple meanings ex: car- Vehicle and also stupid thing in LISP results in false positives

Answer 29

many words mean the same thing. ex: car and automobile are same thing results in false negative

Answer 30

finds directions that maximizes variance (principal component) and finds directions that are mutually orthogonal

Answer 31

1. maximize variance (principal component) 2. mutually orthogonal - global algorithm 3. you can prove best reconstruction (minimize l2 error by projecting onto a new line) 4. you can throw away dimensions with the smallest eigenvalues

Answer 32

it is simply a re-labeling of the dimensions. The directions just tell you how far along each axis it is. It is just a linear rotation of the original dimensions. the new dimensions that PCA returns does not lose any information. It's just a rotation of the data. You can reconstruct your original data with the information

Answer 33

Which means it's a global algorithm. All of the directions and the new features that they find have a big global constraint → have to be mutually orthogonal.

Answer 34

Is there another basis, which is a | linear combination of the original basis, that best expresses our data set?

Answer 35

a computational problem that can be solved by finding the eigenvalues and/or eigenvectors of a matrix. In PCA, we are analyzing the covariance matrix (see the paper for details)

Answer 36

PCA: correlation, maximizes variance => reconstruction ICA: independence, linear transformation into new feature space such that each new feature is independent of each other and mutual information as high as possible between original and new features

Answer 37

It is difficult to listen to one conversation at once when there are a lot of conversations going on at the same time. People are hidden variables in here and microphones are the observables

Answer 38

figure out the hidden variables from observables

Answer 39

- mutually orthogonal: PCA - mutually independent: ICA - maximal variance: PCA - maximal mutual information: ICA - ordered features: PCA - bag of features: ICA and ish PCA

Answer 40

PCA is global (in face finds brightness then average face) ICA finds local/structure (in faces it finds noses then eyes, in the natural world, it finds edges. In documents it's topics)

Answer 41

Generates random directions projects data into those directions works very well if the next thing you do is classification

Answer 42

Fast (much faster than PCA or ICA) | Other advantages: cheap, simple, easy,

Answer 43

Find a projection that discriminates based on the label. Pays attention to how the resulting components are used "In contrast to PCA, LDA is “supervised” and computes the directions (“linear discriminants”) that will represent the axes that maximize the separation between multiple classes."

Answer 44

structure and edges

Answer 45

ICA: probability PCA: linear algebra

Answer 46

Transition State. T(s,a,s') is transition probability of being in state s, take action a, and end up in state s'. This function produces the P(s'|s,a): The probability of a new state given your current state and given the action you take

Answer 47

The things that you can do in a state (up, down, left, right)

Answer 48

Scalar Reward for being in a state, taking action, etc. (Red and green states in board example). Represents our Domain Knowledge

Answer 49

A policy is a solution to Markov decision process. Tells you what action to take from a particular state

Answer 50

We assumed there is not a time that is clicking. If there is then policy would depend on state and time

Answer 51

if I prefer one sequence of events today, then I prefer the same sequence tomorrow

Answer 52

discounted rewards allows us to go infinite distance in finite time because it is geometric. Rmax/1-gamma

Answer 53

the expected outcome from that point if we follow the utility function f

Answer 54

Utility: reward for that state and all the reward from that point on (Long term) Reward: Immediate: reward (Short term)

Answer 55

Key equation for RL. A recursive equation that defines the true value of being in some particular state including policy, rewards, transition matrix, gammas, etc. It is the utility of the state.

Answer 56

start w/ arbitrary utilities, update utilities based on neighbors, repeat until convergence

Answer 57

start with some policy (guess), evaluate its utility, improve policy by updating it to the policy that takes the action that maximizes the expected utility based on what we just calculated. Repeat

Answer 58

Evaluating the Bellman equations from data

Answer 59

- only the present matters | - things are stationary (the rules don't change)

Answer 60

They are robust to the stochasticity of the world

Answer 61

Rewards are substantial at first and quickly trail off

Answer 62

the value for arriving in s, leaving via a, and proceeding optimally thereafter

Answer 63

Full learning, forget everything you learn and just jump to new v

Answer 64

Won't learn at all. Will just assign v to v

Answer 65

- how initialize Q hat - how to decay alpha sub t - how to choose actions

Answer 66

Can choose Q hat which is greedy (always choose best Q-hat), but can hit local mins. Solve this problem with simulated annealing like approach (take random actions sometimes) Could choose Q hat randomly, but then you are not using what you learned about Q hat so far. You learn optimal policy, but don't follow it.

Answer 67

Very slow! takes a long time to get to infinity without restarts.. even longer with

Answer 68

in a 2 player, zero sum deterministic game of perfect information, minimiax = maximin and there always exists an optimal pure strategy for each player (with rational agent)

Answer 69

if and only if each player won't change their strategy given all the other player's strategy

Answer 70

First round = corporate. All future rounds = copy the opponent's previous move. A strategy for infinite games.

Answer 71

General idea: in repeated games, the possibility of retaliation opens the door for cooperation Folk Theorem: In math: results known, at least to experts in the field, and considered to have established status, but not published in complete form

Answer 72

average payoff of some point strategy

Answer 73

Any feasible payoff profile that strictly dominates the minimax/security profile can be realized as a Nash equilibrium payoff profile, with sufficiently large discount factor. Proof: if it strictly dominates the minimax profile, can use it as a threat. Better off doing what you are told

Answer 74

As long as you cooperate, you get mutual benefit. If you ever defect "deal out vengeance" forever

Answer 75

The vengeance could cost more than trying to cooperate so it's not realistic to always punish opponent

Answer 76

Each player is always taking a best response independent of history

Answer 77

cooperate if agree, defect is disagree

Answer 78

Yes. No matter what state they are in, the average reward is mutual cooperation

Answer 79

Can build pavlov like machine for any game. construct subgame perfect nash equilibrium for any game in polynomial time. This is because if it's pavlov, we can get to pavlov quickly, and if not it is zero-sum like (solve an LP), or at most one player improves

Answer 80

2 players and each player has its own reward matrix and is average reward repeated game

Answer 81

actions that the players take impact not just the rewards but also future states

Answer 82

use minimax on q value instead of max

Answer 83

``` value iteration works minimax-Q converges under same conditions that Q does unique solution to Q* policies can be computed independently update efficient Q functions sufficient to specify policy ```

Answer 84

Can't do minimax anymore. Instead use Nash of Q.

Answer 85

Value iteration doesn't work Nash doesn't converge No unique solutions to Q* Policies can not be computed independently The update is not efficient unless P=PPAD Q functions are not sufficient to specify policy

Answer 86

We know the means (assume the means) and then calculate how likely it is that the data came from the means. Knowing means and variances allows you to calculate which point came from which cluster. For each point, which cluster is it from?

Answer 87

We know the clusterings, so calculate the means and variances for these clusters.

Answer 88

finding higher and higher likelihoods

Answer 89

1. knowledge discovery - it is useful to keep understand what features are meaningful (interpretability and insight) 2. curse of dimensionality

Answer 90

Decision Trees (a type of filtering) and in particular, information gain

Answer 91

neural nets

Answer 92

You project onto n dimensions with n features. As you move from 1st principal component to n principal components, the eigenvalues monotonically non-increasing. You can throw away the ones with the least eigenvalue = throw away the ones with the least variance

Answer 93

If one of your original dimensions is directly related to the data but there is gaussian noise, PCA will end up throwing away that original dimension

Answer 94

filtering. It transforms into a new space where features can be filtered

Answer 95

False. It can happen to find independent projects. It's finding uncorrelated dimensions

Answer 96

Max mutual info between new features and old features | Mutually independent between new features

Answer 97

ICA. Directional meaning it gives very different answers if you rotate the matrix that it is given.

Answer 98

Policy search: map states to actions through policies Value function based: map state to value through utilities Model-Based: map states and actions to new states and rewards through transitions and rewards

Answer 99

``` U(s) = max over all actions of Q(s,a) in that state pi(s) = argmax over all action of Q(s,a) ```

Answer 100

evaluating the Bellman equation from data. We know the transitions, not the model. We know

Answer 101

- we visit s,a infinitely often so it needs to run a very long time - s' needs to be drawn from the actual transition probabilities T(s,a,s') - rewards need to be drawn from reward function R(s)

Answer 102

Players consider worst case counter. A is trying to max and B is trying to min. So A finds the maximum minimum and B is trying to find the minimum maximum. There always exists an optimal pure strategy for each player

Answer 103

change from perfect information to hidden information. Minimax works in 2-sum zero-sum deterministic OR nondeterministic games of perfect information with pure strategies

Answer 104

mixed strategy, you choose with probability

Answer 105

You can use MDPs! Only the last state matters, etc.

Answer 106

pair of payoffs, one for each player, that represents the payoff achieved a player defending itself against a malicious adversary

Answer 107

when you can have mixed strategy with minmax profile

Answer 108

repeated stochastic games (folk theorem) cheap talk -> correlated equilibrium. players can talk a little bit cognitive hierarchy -> best response. Rather than solve for equilibria, best response to what you believe the other player will do side payments - (cocoa values) players can pay the opponent to help them

CS 7641 Final Flashcards

(136 cards)