CS 7641 Final Flashcards

1
Q

Supervised Learning

A

Use labeled training data to generalize labels to new instances
(function approximation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

unsupervised learning

A

making sense out of unlabeled data

data description

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Single Linkage Clustering

A
  • Consider each object a cluster (n objects)
  • Define intercluster distance as the distance between the two closest two points in the two clusters
  • Merge two closest clusters
  • Repeat n-k times to make k clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Intercluster Distance in Single Linkage Clustering

A

the distance between the 2 closest points in the 2 clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Inter cluster distance represents

A

domain knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

T/F Single Link Clustering is Deterministic

A

True. Unless there are ties, it will give us an answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In SLC, if distance can be represented by edge link of a graph, what algorithm is it the same as?

A

Minimum Spanning Tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the running time of Single Link Clustering (SLC) in terms of n points and k clusters

A

O(n^3). O(n^2) to look at all distances to find closest pairs and have to do this n times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Issues with SLC

A

Could end up with weird clusters that surround other clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

k-means clustering algorithm

A
  • pick k centers (art random)
  • each center “claims” its closest points
  • recompute the centers by averaging the clustered points
  • repeat until convergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When does k-means converage

A

When recomputing the centers does not move

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

T/F the centers have to be a point in the collection of objects

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

k-means in euclidean space process

A

circle between partition assignment and recalculating the centers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which optimization algorithm is most like k-means? HC, GA, or SA

A

Hill Climbing. You take a step towards a configuration that is better than you have before

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

K-means runs out of configurations and converges in finite time when

A

finite configurations, break ties consistently, and never go into a configuration with a higher error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Big O for Each iteration of k-means clustering is

A

polynomial (pretty fast). O(k n): look at distance between each k center and n points and run through all the points to redefine clusters. could also have extra d if d dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Big O for Finite iterations of k-means

A

exponential O(k^n). first of n objects can be in any of k clusters, 2nd of n objects can be in k clusters, etc. In practice, it is much faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

If ties are broken consistently in k-means then error

A

decreases [with one exception]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

T/F k-means cannot get stuck

A

False (local optima can still happen)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How to avoid k-means getting stuck

A

random restart OR do initial analysis and pick centers in good spots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Soft Clustering Overview

A

Points can be probabilistically from one cluster or another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Soft Clustering Idea

A

Assume the data was generated by
1. select one of k gaussians uniformly [fixed known variance]
2. sample x from Gaussian
3. repeat n times
Task: find a hypothesis that maximizes the prob of the data (ML)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the maximum likelihood gaussian (soft clustering)

A

The ML mean of the Gaussian is the mean of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Expectation Maximum

A

Has two phases expectation which is soft clustering and maximization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How is expectation maximization like k-means
k-means if cluster assignments use arg max (prob of 0 or 1). It improves in a probabilistic metric like k-means improves in squared error metric
26
Properties of Em
- Monotonically non-decreasing likelihood (it's not getting worse) - Does not converge b/c infinite configurations b/c infinite number of probabilities (partially does) - Will not Diverge (numbers don't blow up and become infinitely large) - Can get stuck (local optima) - Works with any distribution (If EM solvable)
27
Clustering Properties
- Richness - Scale Invariance - Consistency
28
Richness
For any assignment of objects to clusters, there is some distance matrix D such that your clustering algorithm returns that clustering.
29
Scale Invariance
scaling distances by a positive value does not change the clustering
30
Consistency
Shrinking intra cluster distances and expanding intercluster distances does not change the clustering
31
Impossibility Theorem
No clustering can achieve all three of richness, scale invariance, consistency
32
Curse of dimensionality
Data needed grows exponentially with features (2^n)
33
Big O of feature selection
np-hard, 2^n, exponential
34
Two approaches to feature selection
filtering and wrapping
35
filtering overview
set of features as input, passes to same search algorithm that outputs fewer features
36
wrapping overview
set of features, search over some subset of features, hands to leaning algorithm which reports how well those features performed
37
Pros of filtering
speed: faster than wrapping because you don't have to worry about what the learner does. Ignores the learning problem
38
cons of filtering
look at features in isolation.
39
Pros of wrapping
takes into account model bias and learning itself
40
con of wrapping
extremely slow compared to filtering
41
ways to do feature filtering
- information gain - variance, entropy - "useful features" - independent/non redundant - ex: DT, NN (trim features with low weights)
42
ways to do feature wrapping
- hill climbing - randomized Optimization - forward search - backward search
43
forward search
Look at all features in isolation, keep whatever feature is best. Then look at adding any of the remaining features and choose whichever is best in combination with first feature, etc. This is similar to hill climbing
44
backward search
start with all the features and figure out which one can you eliminate, one at a time until you get to a subset that performs well
45
Relevance: strongly relevant
xi is strongly relevant if removing is degrades the Bayes optimal classifier
46
Relevance: weakly relevant
Xi is weakly relevant if - not strongly relevant - and there exists some subset of features such that adding xi to the subset improves the bayes optimal features
47
Relevance measures effect on
Bayes optimal classifier. INFORMATION
48
Usefulness measures effect on
a particular predictor. ERROR given model/learner (c is useful in neural nets)
49
feature transformation
the problem of pre-processing a set of features to create a new (smaller? more compact?) feature set, while retaining as much (relevant? useful?) information as possible
50
T/F Feature selection is a subset of feature transformaiton
True
51
feature transformation vs feature selection
transformation: linear combination of original features like 2x1 + x2 as 1 feature. selection: have x1, x2, x3, x4 and only take x1 and x2
52
ad hoc information retrieval problem (google problem)
a whole bunch of databases and you want to retrieve the subset of documents that are relevant to the query (features) ad hoc - you don't know what they queries are
53
polysemy
words can have multiple meanings ex: car- Vehicle and also stupid thing in LISP results in false positives
54
synonomy
many words mean the same thing. ex: car and automobile are same thing results in false negative
55
principal components analysis overview
finds directions that maximizes variance (principal component) and finds directions that are mutually orthogonal
56
principal component analysis properties
1. maximize variance (principal component) 2. mutually orthogonal - global algorithm 3. you can prove best reconstruction (minimize l2 error by projecting onto a new line) 4. you can throw away dimensions with the smallest eigenvalues
57
best reconstruction in principal component analysis (PCA)
it is simply a re-labeling of the dimensions. The directions just tell you how far along each axis it is. It is just a linear rotation of the original dimensions. the new dimensions that PCA returns does not lose any information. It's just a rotation of the data. You can reconstruct your original data with the information
58
mutually orthogonal in PCA
Which means it's a global algorithm. All of the directions and the new features that they find have a big global constraint → have to be mutually orthogonal.
59
What does PCS fundamentally ask
Is there another basis, which is a | linear combination of the original basis, that best expresses our data set?
60
eigenproblem
a computational problem that can be solved by finding the eigenvalues and/or eigenvectors of a matrix. In PCA, we are analyzing the covariance matrix (see the paper for details)
61
PCA vs ICA overview
PCA: correlation, maximizes variance => reconstruction ICA: independence, linear transformation into new feature space such that each new feature is independent of each other and mutual information as high as possible between original and new features
62
Blind Source Separation Problem (cocktail)
It is difficult to listen to one conversation at once when there are a lot of conversations going on at the same time. People are hidden variables in here and microphones are the observables
63
Independent Components Analysis (ICA) goal
figure out the hidden variables from observables
64
PCA vs ICA - mutually orthogonal - mutually independent - maximal variance - maximal mutual information - ordered features - bag of features
- mutually orthogonal: PCA - mutually independent: ICA - maximal variance: PCA - maximal mutual information: ICA - ordered features: PCA - bag of features: ICA and ish PCA
65
Which is global and which is local: PCA vs ICA
PCA is global (in face finds brightness then average face) ICA finds local/structure (in faces it finds noses then eyes, in the natural world, it finds edges. In documents it's topics)
66
RCA: Random Components Analysis
Generates random directions projects data into those directions works very well if the next thing you do is classification
67
What is the big advantage of RCA
Fast (much faster than PCA or ICA) | Other advantages: cheap, simple, easy,
68
LDA: Linear Discriminant Analysis
Find a projection that discriminates based on the label. Pays attention to how the resulting components are used "In contrast to PCA, LDA is “supervised” and computes the directions (“linear discriminants”) that will represent the axes that maximize the separation between multiple classes."
69
ICA is important for finding
structure and edges
70
PCA vs ICA: which uses primarily Probability vs primarily Linear algebra
ICA: probability PCA: linear algebra
71
Markov Decision Process: Model
Transition State. T(s,a,s') is transition probability of being in state s, take action a, and end up in state s'. This function produces the P(s'|s,a): The probability of a new state given your current state and given the action you take
72
Markov Decision Process: Actions (A)
The things that you can do in a state (up, down, left, right)
73
Markov Decision Process: Reward
Scalar Reward for being in a state, taking action, etc. (Red and green states in board example). Represents our Domain Knowledge
74
Markov Decision Process: Policy
A policy is a solution to Markov decision process. Tells you what action to take from a particular state
75
Markov: Infinite Horizons
We assumed there is not a time that is clicking. If there is then policy would depend on state and time
76
Markov: Utility of Sequences
if I prefer one sequence of events today, then I prefer the same sequence tomorrow
77
Markov: discounted
discounted rewards allows us to go infinite distance in finite time because it is geometric. Rmax/1-gamma
78
Markov: Utility of a policy at a state
the expected outcome from that point if we follow the utility function f
79
Markov: utility vs reward
Utility: reward for that state and all the reward from that point on (Long term) Reward: Immediate: reward (Short term)
80
Bellman Equation
Key equation for RL. A recursive equation that defines the true value of being in some particular state including policy, rewards, transition matrix, gammas, etc. It is the utility of the state.
81
Markov Decision: Value Iteration
start w/ arbitrary utilities, update utilities based on neighbors, repeat until convergence
82
Markov Decision: Policy Iteration
start with some policy (guess), evaluate its utility, improve policy by updating it to the policy that takes the action that maximizes the expected utility based on what we just calculated. Repeat
83
What is Q-Learning
Evaluating the Bellman equations from data
84
Markov Property (2 things)
- only the present matters | - things are stationary (the rules don't change)
85
Advantage of policies in Markov decisions vs "plans"
They are robust to the stochasticity of the world
86
Markov: Why use gamma^t in utility of sequences
Rewards are substantial at first and quickly trail off
87
RL: What is Q(s,a) function in plain english
the value for arriving in s, leaving via a, and proceeding optimally thereafter
88
RL: What happens if you set alpha (learning rate) of 1
Full learning, forget everything you learn and just jump to new v
89
RL: What happens if you set alpha (learning rate) of 0
Won't learn at all. Will just assign v to v
90
Q-learning family of algorithms varies on what 3 things (generally)
- how initialize Q hat - how to decay alpha sub t - how to choose actions
91
How to choose actions in Q-learning
Can choose Q hat which is greedy (always choose best Q-hat), but can hit local mins. Solve this problem with simulated annealing like approach (take random actions sometimes) Could choose Q hat randomly, but then you are not using what you learned about Q hat so far. You learn optimal policy, but don't follow it.
92
Problem with random restarts in Q-learning
Very slow! takes a long time to get to infinity without restarts.. even longer with
93
Fundamental Resul Theorem
in a 2 player, zero sum deterministic game of perfect information, minimiax = maximin and there always exists an optimal pure strategy for each player (with rational agent)
94
nash equilibrium
if and only if each player won't change their strategy given all the other player's strategy
95
T/F Any N.E. will survive elimination of strictly dominated strategies
True
96
T/F There is always a N.E. (maybe mixed) for finite strategies and finite player games
True
97
T/F In the n player pure strategy game, if elimination of strictly dominated strategies eliminates all but one combination, that combination it he unique N.E
True
98
Tit-for-tat
First round = corporate. All future rounds = copy the opponent's previous move. A strategy for infinite games.
99
Folk Theorem in Math
General idea: in repeated games, the possibility of retaliation opens the door for cooperation Folk Theorem: In math: results known, at least to experts in the field, and considered to have established status, but not published in complete form
100
game theory: feasible region
average payoff of some point strategy
101
Game Theory: Folk Theorem
Any feasible payoff profile that strictly dominates the minimax/security profile can be realized as a Nash equilibrium payoff profile, with sufficiently large discount factor. Proof: if it strictly dominates the minimax profile, can use it as a threat. Better off doing what you are told
102
Grim Trigger
As long as you cooperate, you get mutual benefit. If you ever defect "deal out vengeance" forever
103
Implausible Threat
The vengeance could cost more than trying to cooperate so it's not realistic to always punish opponent
104
Game Theory: Subgame Perfect
Each player is always taking a best response independent of history
105
Game Theory: Pavlov
cooperate if agree, defect is disagree
106
T/F is Pavlov sub game perfect?
Yes. No matter what state they are in, the average reward is mutual cooperation
107
Computational Folk Theorem
Can build pavlov like machine for any game. construct subgame perfect nash equilibrium for any game in polynomial time. This is because if it's pavlov, we can get to pavlov quickly, and if not it is zero-sum like (solve an LP), or at most one player improves
108
bimatrix game
2 players and each player has its own reward matrix and is average reward repeated game
109
what makes stochastic games more interesting
actions that the players take impact not just the rewards but also future states
110
How can you use bellman equation in zero sum stochastic games
use minimax on q value instead of max
111
Properties of Q learning in zero sum stochastic games
``` value iteration works minimax-Q converges under same conditions that Q does unique solution to Q* policies can be computed independently update efficient Q functions sufficient to specify policy ```
112
General Sum Stochastic Games
Can't do minimax anymore. Instead use Nash of Q.
113
What properties change for general sum stochastic games
Value iteration doesn't work Nash doesn't converge No unique solutions to Q* Policies can not be computed independently The update is not efficient unless P=PPAD Q functions are not sufficient to specify policy
114
Expectation Maximization - soft clustering / expectation step
We know the means (assume the means) and then calculate how likely it is that the data came from the means. Knowing means and variances allows you to calculate which point came from which cluster. For each point, which cluster is it from?
115
Expectation Maximization - Maximization steps
We know the clusterings, so calculate the means and variances for these clusters.
116
monotonically non decreasing
finding higher and higher likelihoods
117
Why care about feature selection
1. knowledge discovery - it is useful to keep understand what features are meaningful (interpretability and insight) 2. curse of dimensionality
118
What algorithm looks at feature selection
Decision Trees (a type of filtering) and in particular, information gain
119
What algorithm did we learn in the first half of the course that does feature transformation
neural nets
120
principal component analysis. Why can throw away eigenvalues
You project onto n dimensions with n features. As you move from 1st principal component to n principal components, the eigenvalues monotonically non-increasing. You can throw away the ones with the least eigenvalue = throw away the ones with the least variance
121
Con of PCA
If one of your original dimensions is directly related to the data but there is gaussian noise, PCA will end up throwing away that original dimension
122
PCA is like filtering or wrapping
filtering. It transforms into a new space where features can be filtered
123
T/F PCA goal is to find independent projects
False. It can happen to find independent projects. It's finding uncorrelated dimensions
124
How can ICA want to maximize mutual information and also mutually independent
Max mutual info between new features and old features | Mutually independent between new features
125
What feature transformation problem does much better in blind source separation problem and is directional
ICA. Directional meaning it gives very different answers if you rotate the matrix that it is given.
126
RL: 3 different types of RL
Policy search: map states to actions through policies Value function based: map state to value through utilities Model-Based: map states and actions to new states and rewards through transitions and rewards
127
RL: Q-Learning what is U(s) and pi(s)
``` U(s) = max over all actions of Q(s,a) in that state pi(s) = argmax over all action of Q(s,a) ```
128
RL: What is Q-learning
evaluating the Bellman equation from data. We know the transitions, not the model. We know
129
Q-learning converges if
- we visit s,a infinitely often so it needs to run a very long time - s' needs to be drawn from the actual transition probabilities T(s,a,s') - rewards need to be drawn from reward function R(s)
130
minimax
Players consider worst case counter. A is trying to max and B is trying to min. So A finds the maximum minimum and B is trying to find the minimum maximum. There always exists an optimal pure strategy for each player
131
Relaxing Which constraint makes minimax fail?
change from perfect information to hidden information. Minimax works in 2-sum zero-sum deterministic OR nondeterministic games of perfect information with pure strategies
132
mixed strategy vs pure strategy
mixed strategy, you choose with probability
133
Game Theory: finite state strategy. How to solve?
You can use MDPs! Only the last state matters, etc.
134
minmax profile
pair of payoffs, one for each player, that represents the payoff achieved a player defending itself against a malicious adversary
135
security level
when you can have mixed strategy with minmax profile
136
q-learning does not work with general stochastic games, what could work?
repeated stochastic games (folk theorem) cheap talk -> correlated equilibrium. players can talk a little bit cognitive hierarchy -> best response. Rather than solve for equilibria, best response to what you believe the other player will do side payments - (cocoa values) players can pay the opponent to help them