Final Prep Flashcards

Question 1

Q

How does single linkage clustering work? (13.4)

Answer

A

Consider each object (sample) a cluster (n objects)
define intercluster distance as distance b/t the closest two points in clusters with different labels
merge two closest clusters
repeat n-k times to make k clusters

If two points are in the same cluster, the intercluster distance is 0.

You end up creating a tree structure with the root being everything in a single cluster.

How you define distance is one way of inserting domain knowledge.

This typically leads to chain like structures.

Question 2

Q

When might median be a good distance metric in single linkage clustering? (13.4)

Answer

A

When the value of the distance itself doesn’t really matter, but the order is what is most important.

Question 3

Q

What is the a metric vs non metric statistic? (13.4)

Answer

A

Metric: the specific values matter a lot (average, the details of the number matter)
Non-Metric: the specific values don’t matter, but rank order does (eg median)

Question 4

Q

What’s the running time of Single Linkage Clustering? (13.5)

Answer

A

O(n^3) because:

for k times:
find distance b/t each point and next closest point which is O(n^2) and do that n times.

Question 5

Q

How does K-Means clustering work? (13.7)

Answer

A

Pick K rand points to be the centers of the k clusters
- Each center “claims” its closest points
- Recompute centers by averaging clustered points (new center doesn’t have to be a point!)
- Repeat until clusters no longer change

Question 6

Q

Why does K-means clustering converge? (13.11)

Answer

A

The error is monotonically non-increasing.

When updating the partition:
•P(x) = argmin_i || x - center_i ||^2
•You’re finding the partition that is closest to x
• You only move a point from from cluster to another if that distance (error) gets smaller. So the error can only every be 0 or decreasing

When updating the center of each partition (without changing the partition):
•center_i = sum(y) / |C_i|
•Average minimizes the squared error (take derivative of E and set it equal to 0)

For this to work you must have some way of breaking ties that is deterministic and consistent.

You will never revisit a configuration you’ve already looked at.

Question 7

Q

What randomized optimization technique is most like K-means clustering? (13.10/11)

Answer

A

Hill Climbing - each time you set P(x) for all points and then readjust the center, you are always either decreasing the error or the error doesn’t change. So always improvement or no improvement, but never de-provement.

Question 8

Q

What are the properties of k-means clustering? (13.12)

Answer

A

Each iteration is polynomial: O(k*n) distance between the k centers and points (w/ extra factor of d if d dimensional points)
Finite (exponential) iterations: O(k^n)
Error decreases if ties are broken consistently
- Exception is that if it doesn’t improve, the next time you would see if has converged (double check that understanding?)

Question 9

Q

Can k-means clustering get stuck and if so how could you avoid it? (13.12)

Answer

A

Yes it can get stuck!

a(b) (c)(d)
e f

if (x) are the cluster centers, then (b) is going to claim a, b, e, and f and c and d will just stay where they are.

Can be avoided via random restarts or selecting points on the fringe or making sure the cluster centers are as far away form each other as possible.

Question 10

Q

What is soft clustering and how does it work? (13.14)

Answer

A

This is similar to bayesian learning where we are trying to find a hypothesis that maximizes the probability of the data.

Assume data was generated as follows:
• Select one of K gaussians [ with fixed variance] uniformly
• Sample x_i from that gaussian
• Repeat n times

Goal: Find a hypothesis h = that maximizes probability of the data (Max. likelihood)

This means that each data point could probabilistically belong to more than one cluster.

Question 11

Q

What is the ML mean of the gaussian if there is just one K? (13.15)

Answer

A

If you know all the data is coming from the same gaussian, then the ML mean is just the mean of the data.

Question 12

Q

What is expectation maximization (EM)? (13.16/17)

Answer

A

Moving back and forth between soft clustering and computing the means for the soft clustering.

This is conceptually much like k-means, but you’re assigning the probability of being in each cluster to each given datapoint.

Question 13

Q

What are the properties of EM? (13.18)

Answer

A

Each iteration of EM is a monotonically non-decreasing likelihood (finding higher and higher likelihoods)
Does not have to converge, but it will not diverge either
Can get stuck (happens ALL THE TIME in practice; many local optima. Solution: random restarts)
Works with any distribution (if E, M are both solvable)
usually the E step is the harder part and M is just basic math

Question 14

Q

What are the desirable properties of clustering algorithms? (13.19)

Answer

A

Richness: For any assignment of objects to clusters, there is some distance matrix D that P_d returns that clustering. You can’t force it to have a specific number of clusters. can represent any number of clusters. (13.20).
Scale-Invariance: Scaling distances by a positive value does not change the clustering (going from miles to kilometers may change the values, but it doesn’t change the clustering)
Consistency: Shrinking intra-cluster distances (dist b/t samples in same cluster) and expanding inter-cluster distances (dist b/t the clusters) does not change the clustering, P_d = P_d’

Question 15

Q

Which clustering scheme can achieve all the desirable properties of k means clustering? (13.21)

Answer

A

NONE! Impossibility theorem (Kleinberg) says no clustering scheme can achieve all three:
• richness
• scale invariance
• consistency

Question 16

Q

Why do we bother with feature selection? (14.2)

Answer

A

Curse of dimensionality

* Allows you to focus on the features that matter and ignore those that don’t

Question 17

Q

What is filtering vs wrapping in feature selection? (14.4)

Answer

A

Filtering:
N Features -> Search Algorithm -> Fewer Features
• The criterion of how you know if you’re doing well is buried within the search algorithm
• Can use the labels in your filter

Wrapping:
N Features -> [ Search Learning Alg.] -> Fewer Features
• In this case you pass the search results to learning algorithm, it reports back how well it does with those features and this goes back and forth until you come to a solution

Question 18

Q

What are the pros and cons of filtering and wrapping? (14.5)

Answer

A

Filtering:
+ Speed!
- Speed … you’re really looking at isolated features and aren’t considering feature dependence
- ignores the learning problem

Wrapping:
+ take into account model bias and learning!
- But it’s really really slow

Question 19

Q

What are some ways in which you could do filtering? (14.6)

Answer

A

Information gain
variance entropy
‘useful’ features
independent/non-redundant features

The goal is the make the learning problem easier by reducing the number of features which reduces the number of samples necessary, up until the point where it makes the learning problem harder because you just don’t have enough differentiating features.

Question 20

Q

What are some ways in which you could do wrapping? (14.6/7)

Answer

A

Any randomized optimization algorithm
Forward Search (a form of hill climbing)
Backward Search (a form of hill climbing)

Question 21

Q

How does forward search work? (14.7)

Answer

A

Go through each feature one at a time and pass it to the learning algorithm and get the score for each
Keep the feature with the highest score
Now combine that one feature with every other again, passing it to the learner and getting back a score
Again keep the best combination
Repeat this until adding a feature doesn’t produce significant gains to the learner

Question 22

Q

How does backward search work? (14.7)

Answer

A

Say you start with 5 (N) features: 1, 2, 3, 4, 5. In this case you’re trying to eliminate one feature at a time.
• Pass all combinations of N-1 to the learner and get result:
• 1, 2, 3, 4
• 1, 2, 4, 5
•1, 3, 4, 5
• 2, 3, 4, 5
• Keep the best combination and then repeat until your score peaks and then goes down

This is like eliminating people during tryouts.

Question 23

Q

When is a feature strongly relevant? (14.9)

Answer

A

If removing that feature degrades the Bayes Optimal Classifier, then it is strongly relevant.

That is to say, if you are considering ALL features and you remove this one feature and the BOC degrades, it is strongly relevant.

You can make a feature not strong just by putting a copy in there. (14.11)

Question 24

Q

When is a feature weakly relevant? (14.9)

Answer

A

If it is NOT strongly relevant AND adding it to some other subset of features would improve the Bayes Optimal Classifier.

So if you have a subset of features and when you add this feature it improve the BOC, it is weakly relevant.

If a feature is not strongly relevant and it is not weakly relevant then it is irrelevant.

Question 25

Q

What does feature relevance do? (14.10)

Answer

A

It measures the effect on the Bayes Optimal Classifier (increase or decrease). Relevance is about how much information a feature provides.

Example:
If a binary feature C is 1 for all samples, it has 0 entropy and is irrelevant.

Question 26

Q

What is feature usefulness? (14.10)

Answer

A

Usefulness represents the effect on the error of a model or learning algorithm. This is what we ultimately care about.

Question 27

Q

Revisit 14.8 - Quiz about minimum features required.

Answer

A

Make sure to understand what’s going on here.

Question 28

Q

What is the Bayes Optimal Classifier? (TBD) - Probably in the bayesian lectures.

Answer

A

Look this up.

Computes the best label given all probabilities you could computer over all the hypotheses. Any other algorithm has an inductive bias.

The gold standard - the best you could do if you had everything and all the time in the world.

Relevance is usefulness to the

Question 29

Q

Why is the Bayes Optimal Classifier used to measure relevance? (14.10)

Answer

A

I think this is teased out in this video, but needs to be summarized better.

Computes the best label given all probabilities you could computer over all the hypotheses. Any other algorithm has an inductive bias.

Question 30

Q

What is feature transformation? (15.1)

Answer

A

Pre-processing a set of features to create a new, smaller (usually) set while retaining as much information as possible (useful, relevant information).

Example, starting with features x1, x2, x3, x4:

Feature Selection => x2, x3
Feature Transformation => 2*x1 + x3

Question 31

Q

What is Principal Components Analysis (PCA)? (15.5)

Answer

A

Finds the direction of maximal variance in the data and finds directions that are mutually orthogonal.

So it first finds the direction of max variance (1st or principal component), then finds the line orthogonal (2nd or 2nd principal component).

It’s like doing a transformation into a new space where feature selection can work (15.7). An eigenvalue of 0 for a particular dimension would mean there is no variance and thus gives you no information (irrelevant, but doesn’t mean its useless).

About finding correlation by maximizing variance and ability to do reconstruction (15.8)

Question 32

Q

What are the properties of PCA? (15.6)

Answer

A

It gives you the principal direction of maximum variance
It gives you orthogonal directions beyond that
It’s a global algorithm - You’re working across the entire, global space of the problem domain
Gives you the best reconstruction
The eigenvalues of successive principal are monotonically non-increasing, which means they tend to decrease. So each successive principal essentially has less variance, and you can throw away the ones with the last variance.
Can look at eigenvalue to see which dimensions are important

Question 33

Q

What is reconstruction? (15.6)

Answer

A

If you were to perform PCA and get the principal component and project all the samples only onto that one dimension and then try to re-project onto the original dimension space, it would give you the smallest L2 (squared error) error of any other projection.

By maximizing variance, you’re maintaining distances as best you can for any given dimension.

Question 34

Q

What is Independent Components Analysis (ICA)? (15.8)

Answer

A

The goal of ICA is create a new set of features from the originals where each new feature is statistically independent from every other new feature, that is mutual information between all new feature pairs is 0 (knowing the value of one tells you nothing about the value of any other), and that it maximizes the mutual information between the new features and the original features.

Question 35

Q

Describe hidden variables and observables in the context of ICA (15.9/10)

Answer

A

The hidden variables are the sources - they are what you are trying to recover. Thy represent the new, statistically independent features.

The observables are the original features. They represent some linear combination of the hidden variables.

Question 36

Q

Describe the cocktail party problem in the context of ICA

(Blind source separation problem) (15.10)

Answer

A

There are N people in a room and N microphones. Each person represents a hidden variable while each microphone is a linear combination of each of the people talking.

So the people are the hidden variables and the microphones are the features.

The people are statistically ind ependent because they are producing their own sound waves which are independent of all other people’s sound waves.

Question 37

Q

What is the difference between feature transformation and feature selection?

Answer

A

Feature selection is a subset of feature transformation. Feature transformation has a goal of trying to reduce the number of features by creating new features (x1, x2, x3 => x1 + 3*x2) while feature selection is reducing the number of features by choosing amongst those which have the most influence on the learning output.

Question 38

Q

What are the differences between PCA and ICA? (15.12/13)

Answer

A

•Big One: ICA is local and does selects small features (nose, eyes, mouth, edge detection, etc) while PCA is global and would see things like ‘brightness’ of an image, or the ‘average’ of an image.
• ICA is highly directional while PCA is not (you could feed PCA the input matrix in any orientation and because it’s already providing something directional, it would just be a different transformation and you end up with the same results).
•ICA can help you find the underlying structure of the data
* ICA is good at the blind source separation problem while PCA is terrible at it. PCA doesn’t perform
•PCA has ordered features while ICA is more like a bag of features
•PCA’s features are mutually orthogonal while ICA’s are not
•ICA’s features are mutually independent while PCA’s are not
•PCA maximizes variances while ICA maximizes mutual information (b/t original and new features)
• ICA is focused on probability while PCA is usually solved by linear algebra

Question 39

Q

What are alternatives to PCA and ICA?

Answer

A

RCA - random component analysis. Generates random direction. It is fast and tends to perform pretty well. The number of final dimensions is usually higher than that of PCA and ICA. Tends to produce correlations.
LDA - linear discriminant analysis. Finds a projection that discriminates based on the label, so it at least uses the values of the labels as information.

Question 40

Q

Describe hierarchical complete link clustering

Answer

A

The measure of distance between clusters is the distance between the farthest two points in each cluster.
You will still merge clusters based on the shortest distance, but you’re using a max to define that distance.
Tends to produce spherical clusters with consistent diameter

https://www.youtube.com/watch?v=VMyXc3SiEqs

Question 41

Q

What is a zero sum game? (19.3)

Answer

A

The sum of the two players rewards is constant. That constant may or may not be zero

constant is 0:
player 1 = + 5
player 2 = - 5

constant is 3:
player 1 = +3
player 2 = 0

player 1 = +1
player 2 = +2

etc…

Question 42

Q

What does it mean to have perfect vs hidden information? (19.3)

Answer

A

Perfect information means you know what state the game is in (and thus know the state of everything in the game)

Question 43

Q

What is the minimax strategy? (19.6)

Answer

A

One player is trying find the maximim, minimum (What’s best for itself) and the other player is trying to find the minimum, maximum (what is also best for itself, in a zero sum game)

Question 44

Q

What is the “Fundamental Result” for a two player zero-sum non/deterministic game of perfect information? (19.7)

Answer

A

Minimax == Maximin and there always exists an optimal, pure strategy for each player.

This assumes each player is going to play the same way, eg, maximize their own rewards, or they are being ‘rational’.

Can be deterministic or non deterministic.

Question 45

Q

What is a pure vs a mixed strategy? (19.13)

Answer

A

Mixed strategies have some distribution over strategies, as opposed to choosing a fixed strategy.

Ex: 25% holder, 75% resigner

Question 46

Q

How do you find the solution (expected value) to a mixed strategy? (19.15)

Answer

A

You can plot the lines and then take the maximum of the minimum. This will either be located at the beginning, end, or where they cross.

Question 47

Q

What is the Nash Equilibrium? (19.19)

Answer

A

A Nash Equilibrium exists if neither player, if chosen at random, would switch their strategy, despite the other play keeping their strategy.

In other words, if no one would want to change their strategy given the choice to do so, you’re in a NE.

This works for both mixed and pure strategies.

Question 48

Q

What is strict domination? (19.20)

Answer

A

If you would always choose one strategy over another, no matter what, that chosen strategy strictly dominates the others.

Question 49

Q

What are the rules of determining Nash Equilibrium? (19.21)

Answer

A

For N player pure strategy game, if eliminating all strictly dominated strategies leaves you with one, then it is the NE
A NE will survive elimination of strictly dominated strategies
if there is a finite number of players and a finite number of strategies then there exists at least one NE that may involve mixed strategies.

Question 50

Q

Will the number of games change the outcome for a minimax strategy? (19.23)

Answer

A

No … If you work it backwards, the last game is determined by the matrix, and thus each previous game is also just determined by the matrix. So it doesn’t matter how many repeated games you play, the last one is the only one that matters and it’s defined by the same matrix.

Question 51

Q

Does it matter which player starts?

Answer

A

No - you come to the same solution regardless of who starts.

Question 52

Q

What is the tit for tat strategy for iterated prisoners dilemma? (20.4)

Answer

A

You cooperate in the first round, then copy your opponents previous action every round thereafter.

Question 53

Q

What are the best strategies to play when facing tit for tat? (20.6)

Answer

A

For a low gamma - you should always defect.
Expected value of game = 0 + -6 * gamma / (1 - gamma)

For a high gamma, you should always cooperate.
Expected value of game = -1 / (1 - gamma)

The gamma where the strategies are equal is 1/6.

You may need to recalculate these with different rewards, but the same concept applies.

Question 54

Q

What is the folk theorem? (20.14)

Answer

A

“Any feasible payoff profile that strictly dominates the minmax/security level profile can be realized as a set of Nash equilibrium payoff profile, with sufficiently large discount factor”

Question 55

Q

What is the minmax/security level profile? (20.13)

Answer

A

The best score you can guarantee yourself against a malicious adversary (eg, defect, defect).

Question 56

Q

Describe the “acceptable/prefereable” region in repeated games. (20.13)

Answer

A

The zone where a better score can be obtained for both players than either could guarantee themselves if they were playing against a malicious adversary.

Question 57

Q

What is Grim’s Trigger? (20.15)

Answer

A

Start in mutual cooperation, but if you ever ‘cross the line’ (defect), ‘I will deal out vengeance forever’ (always defect regardless of what you do).

Question 58

Q

What is subgame perfect? (20.16)

Answer

A

Both plays always play their best response independent of history.

They are not subgame perfect (eg, you’re making an implausible threat) if when the time comes for you to follow through with what your strategy says to do, it would actually be worse for you than doing something different.

You take over the brain of players, set the moves, then release them. Given the choice, would they still follow their strategy or would it be better to do something different? If better to do something different - not subgame perfect.

Question 59

Q

How does a Pavlov machine work? (20.18)

Answer

A

Cooperate if we agree, defect if we disagree. This is a NE for prisoners dilemma and it’s subgame perfect.

No matter what states two pavlov machines playing against each other are in, they will always end up in mutual cooperation. It makes plausible threats.

Question 60

Q

Why is the computational folk theorem important? (20.21)

Answer

A

For a bi matrix, repeated game looking at the long term average reward (assuming a high discount), if it’s possible to have a mutually beneficial relationship you can build a pavlov-like machine for any game and use it to construct subgame perfect NE in polynomial time.

if it’s not possible to have mutual cooperation then it’s a zero-sumlike game
then we would solve using LP which either produces a NE or it doesn’t and at most one player improves.

Question 61

Q

What are the properties of zero-sum stochastic games? (20.26)

Answer

A

you use a minimax(Q*(s’, (a’, b’)) in the bellman equation instead of a max
value iteration works
minimax-Q converges
unique solution to Q*
policies can be computed independently
Q Update can be computed efficiently (LP)
Q functions sufficient to specify policy
not known whether it can be solved in polynomial time like a standard MDP is known to be able to

Question 62

Q

What are the properties of general-sum stochastic games? (20.27)

Answer

A

instead of computing minimax, you would compute the NE in the combined bellman equation (eg, Nash-Q)
value iteration doesn’t work
Nash-Q doesn’t converge
No unique solution to Q*
policies cannot be computed independently b/c it’s defined as a joint behavior
update is not efficient
Q functions not sufficient to specify a policy

:(

Question 63

Q

How is KM an optimization problem? (13.9)

Answer

A

You’re scoring different configurations.

You’re defining a configuration consisting of a partition and it’s center, then trying to minimize the error, which can be seen as:

Sum_x [ ||center_p(x) - x||^2], or the sum of the squared distance from each point to the center.

It is most like hill climbing

Question 64

Q

What is required for MDPs? (17.7)

Answer

A

Markov Property: Only the present matters; you don’t care about history
The environment must be stationary: The rules of the environment can’t change

Brainscape's Knowledge GenomeTM

Final Prep Flashcards

Clustering Onwards

Brainscape's Knowledge Genome^TM