Final Exam Flashcards

1
Q

EM always converges? (T/F)

A

False. In practice it usually does, but since the configuration space is infinite the only guarantee we have is that it will not DIVERGE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a WEAKLY RELEVANT feature

A
  1. Not strongly relevant
  2. Exists a subset of features S such that adding x_i to S improves the Bayes Optimal Classifier (BOC)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is feature selection difficult?

A

Because from a theoretical perspective, we would have to score all the possible subsets. Given N features where we desire to downsize to M features, this requires 2^n runs. This means the feature selection problem is EXPONENTIAL, and is known to be NP-HARD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is “Usefulness” (in the context of feature selection) roughly equivalent to?

A

Error GIVEN the model/learner (for example, consider a neural net; some feature x_i might be useful in the sense that performing an operating like y=w.T*x enables the model to achieve better predictions, i.e. lower error predictions, on average, but might still not be “relevant” to a Bayesian Optimal Classifier)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

PCA will never find independent components? (T/F)

A

False. Because of the mutual orthogonality and maximal variance constraints, there ARE cases where it will find independent components by virtue of trying to find components that are uncorrelated. However, uncorrelated != statistical independence necessarily.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Where does domain knowledge get encoded in Filtering style feature selection algorithms?

A

In the criterion used for the search algorithm, e.g. information gain, variance, entropy, “usefulness”, non-redundancy, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the clustering impossibility theorem?

A

It’s the reality that we can have a clustering algorithm that all have (because they’re mutually exclusive, you can at best get 2/3 of them):

  1. Richness
  2. Scale Invariance
  3. Consistency

This was proved by John Kleinberg

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the components of an MDP?

A

SART; i.e. states, actions, rewards, transitions. (Dr. Isbell notes that some people include the discount factor gamma, which he disagrees with since he thinks it’s part of the definition of the problem.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Policy iteration eliminates the “max” term in the Bellman equation? (True/False)

A

True. Since in policy iteration we’re already following the policy, we no longer have to compute over all the actions, so that get’s rid of the nonlinear function.

The reason this is useful is because once we get rid of the nonlinear operation, we can use things like matrix inversion, etc. to solve the n equations and n unknowns in the Bellman equation in linear time. This is MORE COMPUTATIONALLY EXPENSIVE (roughly n^3 for inversions, also is a function of dimensionality) than something like say value iteration, but in theory it should converge to the optimal policy in FEWER ITERATIONS.

Another thing that follows from this is that when performing policy iteration, the jumps in utility will be bigger than in value iteration because we’re iterating policy space instead of value space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is one case where PCA would return the same components as ICA?

A

If the data are Gaussian. This is because the distribution that maximizes variance IS the normal distribution (remember that PCA is trying to find the basis vectors that maximize variance!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The configuration space for both K-Means and EM is finite? (T/F)

A

False. K-Means is finite because the cluster assignments are HARD. There are only so many ways we can assign n points to K clusters (K^n in fact), hence finite.

EM on the other hand works in probability space, so there’s an infinite number of possible configurations.

The key takeaway from this is that while K-MEANS IS GUARANTEED TO CONVERGE, THE ONLY THING GUARANTEED BY EM IS THAT IT WONT DIVERGE!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the intercluster distance of two objects that are in the same cluster?

A

Really a trick question. Answer is zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the time complexity of SLC?

A

O(n^3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the time complexity each iteration of K-Means?

A

O(kn), or O(knd) if we’re considering multiple dimensions d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe filtering style feature selection?

A

Use some criterion for filtering, e.g. information gain, variance, entropy, “usefulness”, non-redundancy, etc.

Important thing to consider is that THE LEARNER IS NOT INCLUDED.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does K-Means clustering algo work

A
  1. Pick K “centers” (at random)
  2. Each center “claims” its closest points
  3. Recompute the centers by averaging the clustered points
  4. Repeat until convergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which part of the EM algorithm is usually the most computationally expensive?

A

The E (expectation) step. This is because we’re doing some sort of inference stuff there, e.g. Bayes nets, etc, whereas the M (maximization) step typically just involves counting things.

The lectures mention this isn’t ALWAYS true, but in general it’s a good heuristic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are some of the advantages and disadvantages of the Filtering style of feature selection algorithms?

A
  1. Speed:
    Pro:
    Doesn’t involve learner, so FS is fast
    Con:
    Isolated features - maybe a feature on its own doesn’t seem important, but when combined with another one it is. Since filtering doesn’t include the learner, no way of knowing this.
    More generally, the major con is that it ignores the actually learning problem.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the intuition behind K-Means?

A

It’s an interative process of randomly partitioning a dataset, and then finding the centroids of each of those partitions, then going back to step one and repartitioning based on the new centroid locations.

This is important because it’s what allows us to think about K-Means in terms of OPTIMIZATION (specifically hill climbing).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How is PCA in some sense a method for determining correlation?

A

Because it is maximizing variance in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Would ICA be better at finding global or local features?

A

Local. The lecture videos bring up an interesting point how in natural scenes ICA actually recovers EDGES.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the two broad categories of Feature Selection algorithms?

A

Filtering and Wrapping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are some other types of hierarchical agglomerative cluster models besides SLC?

A

max, average, median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the main advantage of Random Component Analysis?

A

It’s very fast.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Using entropy as a criterion for feature selection is difficult because you need to know the labels of the data? (T/F)

A

False. This is one of the big advantages of criterion like entropy over things like information gain, which does require labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are some filtering criteria you could use for feature selection?

A
  1. Information gain
  2. Variance, entropy
  3. “Useful” features (e.g. maybe train a neural net and prune away features where the weights are zero)
  4. Independent/non-redundant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What learning algorithm could you use as a Filtering style feature selector? Describe how it could be used.

A

One example would be a decision tree. You could use the feature importances to select a subset of the most important features, and then pass that onto another learner. The advantage of doing this is different learning algorithms suffer from different types of inductive biases, so there may be some advantage to doing feature selection using one algorithm but a different algo for that actual learning (e.g maybe you need a learner that’s more robust to noise data, one that has faster inference times, one that fits within some memory/cpu requirements, etc.).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Given a set of points, what is the maximum likelihood estimate for the mean of a Gaussian centered around that dataset?

A

It’s just the mean of the data, for same reasons we described in K-Means, proved using calculus that mean is best way of minimizing error in least squares sense.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Unlike K-Means, EM can never get stuck in local optima? (T/F)

A

False, and for the same reason as K-Means. We’re still relying on random initialization of clusters, so if we get a bad initial assignment, it is still possible for EM to get stuck.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

The reward we get for being in some state is the same thing as the utility for that state? (True/False)

A

False. Reward is all about IMMEDIATE gratification. But the utility considers “how good is it to be in that state”, which is a function of the (discounted) sum of the future rewards that I will get (in expectation) for starting in that state and following some policy pi. So utility is about LONG TERM REWARD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the Bayes Optimal Classifier

A

It’s the idea of “what is the best I could do” on a given learning problem. It’s not any particular algorithm, it’s given ALL algorithms, what is the best I could achieve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Policy iteration is guaranteed to converge to the optimal policy? (True/False)

A

True. (Dr. Isbell mentions that the argument for why this is the case is similar to the argument for why K-Means works.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Which random optimization algorithm is like K-Means? Why is that?

A

Hill climbing. This is because of the iterative process that maximizes the score (i.e. minimizes the error) as the centroids of the partitions are moved around.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Describe wrapping style feature selection?

A

Iterative process of searching for features (typically using some sort of randomized optimization algorithm) and then passing that feature subset to the learner to score it, and then repeating the process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are some ways of performing a “wrapping” style feature selection algorithm?

A

1 . Hill Climbing

  1. Randomized Optimization
  2. Forward selection
36
Q

What is one way of dealing with the problem of getting stuck in local optima with K-Means or EM?

A

Random restarts (just like hill climbing)

37
Q

How difficult is the feature selection problem?

A

Given N features, it is 2^N

38
Q

What are some of the important properties of the EM algorithm?

A
  1. It is monotonically non-decreasing in terms of likelihood
  2. Does not converge (although usually does in practice)
  3. Will not diverge
  4. Can also get stuck (just like K-Means)
  5. Works with any probability distribution (if E, M are solvable)
39
Q

How does Single Linkage Clustering work?

A
  1. Consider each object (i.e. data point) as its own cluster (i.e. n objects)
  2. Define intercluster as distance between the CLOSEST two points in the two clusters. (So starting out the intercluster distance is just the interobject distance)
  3. Merge two closest clusters
  4. Repeat n-k times to make k clusters.
40
Q

How many iterations are possible in K-Means?

A

Finite (exponential) iterations: O(K^n)

41
Q

Describe the Soft Clustering algorithm?

A

**Assume data was generated by normal distribution for this example**

  1. Select one of K gaussians (with a fixed known variance) uniformly
  2. Sample x_i from that Gaussian
  3. Repeat n times

Task: Find a hypothesis h= that maximizes the probability of the data (in terms of maximum likelihood)

42
Q

What is one of the main differences between PCA and ICA?

A

PCA ~= Correlation
ICA ~= Independence

43
Q

Why is the prior not included in the Expectation Maximization algorithm?

A

Because EM uses the maximum likelihood estimate, i.e. we assume a uniform prior, so it can just be dropped from the calculation in the EM update.

44
Q

What are 3 reasons that feature selection is important?

A
  1. Knowledge discovery
  2. Interpretability and insight
  3. Curse of dimensionality (amount of data we need grows exponentially as we increase the features)
45
Q

What is the idea of “consistency” that Dr. Littman talks about in regards to clustering?

A

It’s the idea that if we make a bunch of points within a cluster more similar, or more the points in two clusters more dissimilar, we shouldn’t get different groupings.

46
Q

What is the Bellman Equation?

A
47
Q

If a PCA component has an eigenvalue of 0, what should we do with that component? Why?

A

Throw it away because it contains no useful information.

48
Q

Single link clustering terminates fast?

A

True (mentioned in summary at end of lectures).

49
Q

In terms of feature SELECTION, what is one feature TRANSFORMATION algorithm that is similar to filtering? Why?

A

PCA. It’s similar because once we align the new set of basis vectors that are aligned in the direction of maximal variance, we can throw away (i.e. filter) the components that don’t contribute to the variance.

50
Q

If you have a strongly relevant feature, and you make a copy of it, then both features will be strongly relevant? (T/F)

A

False. By making a copy of it, you’ve made the feature redundant, so now the features are only weakly relevant.

51
Q

What are some properties of PCA?

A
  1. Maximizes variance
  2. Orthogonal
  3. Global algorithm (i.e. each of the PCs are orthogonal to one another)
  4. Best reconstruction (minimizes L2 error moving from N to M dimensions)
  5. Well studied, so lots of fast algorithms exist for it.
52
Q

How does Forward Feature Selection work?

A

(start with a feature, get a score. Whichever is best keep. Then add another feature along with it, repeat, score, Keep going until you achieve some criterion, e.g. minimum sore, etc.

53
Q

K-Means does not struggle with getting stuck in local optima? (T/F)

A

False. It’s definitely possible to get stuck if the initial random cluster assignment is bad.

54
Q

Give an example of a PCA component that would have an eigenvalue of zero. What should we do with that component?

A

A PCA component with an eigenvalue of zero is irrelevant; it contains no useful information. You could think of points lying exactly on the line y=1. There would be zero entropy of the points, hence they would not be relevant (although they could possibly useful to something like a NN classifier in terms of prediction error). Therefore, If the eigenvalue is zero, we should just get rid of the component.

55
Q

A feature x_i that is neither strongly relevant nor weakly relevant it is ________.

A

Irrelevant

56
Q

At a high level, what is PCA doing?

A

It’s lining up a new set of orthogonal basis vectors that are aligned with the direction of the maximal variance of the data.

Another way of looking at it is that it is trying to find all the orthogonal gaussians in a dataset (because a Gaussian distribution is the distribution that maximizes variance!)

57
Q

The error in K-Means can go up?

A

False. It is monotonically non-increasing in error. Things can only be reassigned to a new cluster if the error goes DOWN. This is because the average is the best way of describing the cluster in a least squares sense.

Note that this does require that ties are broken consistently, but as long as that is true, ERROR ALWAYS DECREASES FOR K-MEANS

58
Q

What are some of the advantages and disadvantages of the Wrapping style of feature selection algorithms?

A

Major Pro: Takes into account the MODEL BIAS
Major Pro: Actually considers the LEARNING PROBLEM
Major Con: VERY Slow

59
Q

What’s one way of connecting RL to some of the earlier lectures in the course?

A

The idea of rewards taking the place of teachers. Basically the reward acts as the “teaching signal” in place of an explicit teacher telling you what is good or bad.

You can also think of the rewards as taking the place of domain knowledge.

60
Q

How many configuration spaces are possible in K-Means?

A

K^n

61
Q

Single linkage clustering is a non-deterministic algorithm? (T/F). Justify your reason.

A

False. The distance between objects doesn’t change, so it’s deterministic.

62
Q

What are three properties that are desirable for clustering algorithms?

A
  1. Richness: Ability to assign any inputs to any clustering
  2. Scale Invariance: scaling distances by a positive value does not change the clustering
  3. Consistency: shrinking INTRA-cluster distances and expanding INTER-cluster distances does not change the clustering.
63
Q

What is the computational complexity of choosing the best M features from a set of N features?

A

O(2^N). It is known to be an NP-HARD problem.

64
Q

What is “Relevance” (in the context of feature selection) roughly equivalent to?

A

Information. This is because it’s all about the Bayes Optimal Classifier. Relevant things are THINGS THAT GIVE US INFORMATION.

65
Q

The number of configurations in K-Means is infinite? (T/F)

A

False.

66
Q

Reward and utility are equivalent? (T/F)

A

False. Reward is all about immediate gratification, whereas utility is all about long-term (i.e. delayed rewards).

67
Q

What is the Filtering style of algorithm for feature selection?

A

With filtering the scoring of the subset of features occurs INSIDE the feature selection algorithm, and then then once the best subset is found it is passed on to the learner.

68
Q

What is a STRONGLY RELEVANT feature?

A

x_i is strongly relevant if removing it degrades Bayes Optimal Classifier (BOC), i.e. if without that feature you couldn’t achieve a Bayes Optimal Classifier

69
Q

What is the difference between Relevance vs. Usefullness

A
  1. RELEVANCE measures effect on BAYES OPTIMAL CLASSIFIER
  2. USEFULNESS mesures effect on a PARTICULAR predictor
70
Q

What is one way of thinking about what unsupervised learning is?

A

Compact description

71
Q

The “center” used in K-Means is always one of the datapoints?

A

False. The centroid need not actually be in the set of data points. It’s just describing the central tendency of the cluster. (i.e. the “Means” part of K-Means”)

72
Q

What are the two different fundamental epistemological assumptions that PCA and ICA make?

A
  1. PCA: The the world is made up of a bunch of Gaussians, so finding the most important ones is just a matter of finding the ones that are uncorrelated with one another.
  2. ICA: The world consists of sources that are highly NON-Gaussian, but when added together as part of a linear combination they became Gaussian in the limit via the Central Limit Theorem.
73
Q

What is one of the key assumptions of the sources of ICA?

A

That they are INDEPENDENT of one another (remember the microphone example)

74
Q

What analogy does Dr. Littman use that is a good summary of Forward/Backward Feature Selection algorithms?

A

Imagine a dodgeball team:

  1. Using FFS, you’d start with the best player, and then continue adding best players until you reach some criterion.
  2. Using BFS, you’d start by eliminating the worse player, and then continue eliminating until you’ve reached some criterion of minimum number of features.
75
Q

Would PCA be better at finding global or local features?

A

Global, because of it’s Gaussian/maximal variance nature.

76
Q

What are examples of metric and non-metric statistics?

A

Median is non-metric statistic, Mean IS metric. The reason this is important is that metric statistics like the mean, the values of the data itself matter a lot.

77
Q

What algorithm is Expectation Maximization similar to?

A

K-Means, because it’s an iterative process of assigning X to having been generated by some set of latent variables Z, and then moving the means based on those latent assignments, and then beginning the process all over.

CLUSTERING –> MAXIMIZATION
then repeat until convergence

78
Q

What is mutual information?

A

That amount of information some variable X contains about another variable Y.

79
Q

The Expectation Maximization algorithm is close to, but can never quite be the same as K-Means? (T/F)

A

False. It ends up being the same thing if the cluster assignments use the argmax. (Have to watch the EM lecture videos to see the math for this)

80
Q

When doing feature selection, even if a feature doesn’t seem to be useful for learning, we might as well leave it in there. It isn’t harming anything, right? (T/F)

A

False. Ceterus paribus, leaving a feature in does at least two things that hurt performance:

  1. Time
  2. Sample efficiency. This is because of the curse of dimensionality, which says that we need exponentially more data as we increase the number of features.
81
Q

What is the Wrapping style of algorithm for feature selection?

A

With wrapping the feature selection and learner are embedded into the same algorithm, and iteratively passed back and forth: the feature selection algo outputting a subset of features, the learner then takes that as input, scores it, and passes the score back to the feature selection algo and the process starts all over again.

82
Q

In Q-Learning, what would happen if you set the learning rate to 0? To 0.5? To 1.0?

A

At 0, no learning would happen. At 1.0, you would basically just be discarding the old value and moving straight to the new estimate. 0.5 would then essentially just be the average between your new and old estimates.

83
Q

What requirement is made on the learning rate in Q-Learning?

A

Sum of LR must be infinite, some of squares of LR must be finite.

84
Q

Q-Learning does not converge to the optimal Q function? (T/F)

A

False. It does, but ONLY assuming all s, a pairs are visited infinitely often.

85
Q

What is a “Folk Theorem”?

A

Describes the set of payoffs that can result from Nash Strategies in REPEATED games

86
Q

In repeated games, the possibility of retaliation opens the door for _____?

A

Cooperation

87
Q

What is subgame perfect?

A

Always best response independent of history.