Final Exam Flashcards

Question 1

Q

What is a WEAKLY RELEVANT feature

Answer

A

Not strongly relevant
Exists a subset of features S such that adding x_i to S improves the Bayes Optimal Classifier (BOC)

Question 2

Q

Why is feature selection difficult?

Answer

A

Because from a theoretical perspective, we would have to score all the possible subsets. Given N features where we desire to downsize to M features, this requires 2^n runs. This means the feature selection problem is EXPONENTIAL, and is known to be NP-HARD.

Question 3

Q

EM always converges? (T/F)

Answer

A

False. In practice it usually does, but since the configuration space is infinite the only guarantee we have is that it will not DIVERGE.

Question 4

Q

What is “Usefulness” (in the context of feature selection) roughly equivalent to?

Answer

A

Error GIVEN the model/learner (for example, consider a neural net; some feature x_i might be useful in the sense that performing an operating like y=w.T*x enables the model to achieve better predictions, i.e. lower error predictions, on average, but might still not be “relevant” to a Bayesian Optimal Classifier)

Question 5

Q

PCA will never find independent components? (T/F)

Answer

A

False. Because of the mutual orthogonality and maximal variance constraints, there ARE cases where it will find independent components by virtue of trying to find components that are uncorrelated. However, uncorrelated != statistical independence necessarily.

Question 6

Q

Where does domain knowledge get encoded in Filtering style feature selection algorithms?

Answer

A

In the criterion used for the search algorithm, e.g. information gain, variance, entropy, “usefulness”, non-redundancy, etc.

Question 7

Q

What is the clustering impossibility theorem?

Answer

A

It’s the reality that we can have a clustering algorithm that all have (because they’re mutually exclusive, you can at best get 2/3 of them):

Richness
Scale Invariance
Consistency

This was proved by John Kleinberg

Question 8

Q

What are the components of an MDP?

Answer

A

SART; i.e. states, actions, rewards, transitions. (Dr. Isbell notes that some people include the discount factor gamma, which he disagrees with since he thinks it’s part of the definition of the problem.)

Question 9

Q

Policy iteration eliminates the “max” term in the Bellman equation? (True/False)

Answer

A

True. Since in policy iteration we’re already following the policy, we no longer have to compute over all the actions, so that get’s rid of the nonlinear function.

The reason this is useful is because once we get rid of the nonlinear operation, we can use things like matrix inversion, etc. to solve the n equations and n unknowns in the Bellman equation in linear time. This is MORE COMPUTATIONALLY EXPENSIVE (roughly n^3 for inversions, also is a function of dimensionality) than something like say value iteration, but in theory it should converge to the optimal policy in FEWER ITERATIONS.

Another thing that follows from this is that when performing policy iteration, the jumps in utility will be bigger than in value iteration because we’re iterating policy space instead of value space.

Question 10

Q

What is one case where PCA would return the same components as ICA?

Answer

A

If the data are Gaussian. This is because the distribution that maximizes variance IS the normal distribution (remember that PCA is trying to find the basis vectors that maximize variance!)

Question 11

Q

The configuration space for both K-Means and EM is finite? (T/F)

Answer

A

False. K-Means is finite because the cluster assignments are HARD. There are only so many ways we can assign n points to K clusters (K^n in fact), hence finite.

EM on the other hand works in probability space, so there’s an infinite number of possible configurations.

The key takeaway from this is that while K-MEANS IS GUARANTEED TO CONVERGE, THE ONLY THING GUARANTEED BY EM IS THAT IT WONT DIVERGE!

Question 12

Q

What is the intercluster distance of two objects that are in the same cluster?

Answer

A

Really a trick question. Answer is zero

Question 13

Q

What is the time complexity of SLC?

Question 14

Q

What is the time complexity each iteration of K-Means?

Answer

A

O(kn), or O(knd) if we’re considering multiple dimensions d

Question 15

Q

Describe filtering style feature selection?

Answer

A

Use some criterion for filtering, e.g. information gain, variance, entropy, “usefulness”, non-redundancy, etc.

Important thing to consider is that THE LEARNER IS NOT INCLUDED.

Question 16

Q

How does K-Means clustering algo work

Answer

A

Pick K “centers” (at random)
Each center “claims” its closest points
Recompute the centers by averaging the clustered points
Repeat until convergence

Question 17

Q

Which part of the EM algorithm is usually the most computationally expensive?

Answer

A

The E (expectation) step. This is because we’re doing some sort of inference stuff there, e.g. Bayes nets, etc, whereas the M (maximization) step typically just involves counting things.

The lectures mention this isn’t ALWAYS true, but in general it’s a good heuristic.

Question 18

Q

What are some of the advantages and disadvantages of the Filtering style of feature selection algorithms?

Answer

A

Pro:
Speed. Doesn’t involve the learner, so FS is fast

Con:
Isolated features - maybe a feature on its own doesn’t seem important, but when combined with another one it is. Since filtering doesn’t include the learner, there is no way of knowing this.
More generally, the major con is that it ignores the actual learning problem.

Question 19

Q

What is the intuition behind K-Means?

Answer

A

It’s an interative process of randomly partitioning a dataset, and then finding the centroids of each of those partitions, then going back to step one and repartitioning based on the new centroid locations.

This is important because it’s what allows us to think about K-Means in terms of OPTIMIZATION (specifically hill climbing).

Question 20

Q

How is PCA in some sense a method for determining correlation?

Answer

A

Because it is maximizing variance in the data.

Question 21

Q

Would ICA be better at finding global or local features?

Answer

A

Local. The lecture videos bring up an interesting point how in natural scenes ICA actually recovers EDGES.

Question 22

Q

What are the two broad categories of Feature Selection algorithms?

Answer

A

Filtering and Wrapping

Question 23

Q

What are some other types of hierarchical agglomerative cluster models besides SLC?

Answer

A

max, average, median

Question 24

Q

What is the main advantage of Random Component Analysis?

Answer

A

It’s very fast.

Question 25

Q

Using entropy as a criterion for feature selection is difficult because you need to know the labels of the data? (T/F)

Answer

A

False. This is one of the big advantages of criterion like entropy over things like information gain, which does require labels.

Question 26

Q

What are some filtering criteria you could use for feature selection?

Answer

A

Information gain
Variance, entropy
“Useful” features (e.g. maybe train a neural net and prune away features where the weights are zero)
Independent/non-redundant

Question 27

Q

What learning algorithm could you use as a Filtering style feature selector? Describe how it could be used.

Answer

A

One example would be a decision tree. You could use the feature importances to select a subset of the most important features, and then pass that onto another learner. The advantage of doing this is different learning algorithms suffer from different types of inductive biases, so there may be some advantage to doing feature selection using one algorithm but a different algo for that actual learning (e.g maybe you need a learner that’s more robust to noise data, one that has faster inference times, one that fits within some memory/cpu requirements, etc.).

Question 28

Q

Given a set of points, what is the maximum likelihood estimate for the mean of a Gaussian centered around that dataset?

Answer

A

It’s just the mean of the data, for same reasons we described in K-Means, proved using calculus that mean is best way of minimizing error in least squares sense.

Question 29

Q

Unlike K-Means, EM can never get stuck in local optima? (T/F)

Answer

A

False, and for the same reason as K-Means. We’re still relying on random initialization of clusters, so if we get a bad initial assignment, it is still possible for EM to get stuck.

Question 30

Q

The reward we get for being in some state is the same thing as the utility for that state? (True/False)

Answer

A

False. Reward is all about IMMEDIATE gratification. But the utility considers “how good is it to be in that state”, which is a function of the (discounted) sum of the future rewards that I will get (in expectation) for starting in that state and following some policy pi. So utility is about LONG TERM REWARD.

Question 31

Q

What is the Bayes Optimal Classifier

Answer

A

It’s the idea of “what is the best I could do” on a given learning problem. It’s not any particular algorithm, it’s given ALL algorithms, what is the best I could achieve.

Question 32

Q

Policy iteration is guaranteed to converge to the optimal policy? (True/False)

Answer

A

True. (Dr. Isbell mentions that the argument for why this is the case is similar to the argument for why K-Means works.). The policy iteration algorithm converges within fewer iterations. As a result, the policy iteration is reported to conclude faster than the value iteration algorithm.

Question 33

Q

Which random optimization algorithm is like K-Means? Why is that?

Answer

A

Hill climbing. This is because of the iterative process that maximizes the score (i.e. minimizes the error) as the centroids of the partitions are moved around.

Question 34

Q

Describe wrapping style feature selection?

Answer

A

Iterative process of searching for features (typically using some sort of randomized optimization algorithm) and then passing that feature subset to the learner to score it, and then repeating the process.

Question 35

Q

What are some ways of performing a “wrapping” style feature selection algorithm?

Answer

A

1 . Hill Climbing

Randomized Optimization
Forward selection

Question 36

Q

What is one way of dealing with the problem of getting stuck in local optima with K-Means or EM?

Answer

A

Random restarts (just like hill climbing)

Question 37

Q

How difficult is the feature selection problem?

Answer

A

Given N features, it is 2^N

Question 38

Q

What are some of the important properties of the EM algorithm?

Answer

A

It is monotonically non-decreasing in terms of likelihood
Does not converge (although usually does in practice)
Will not diverge
Can also get stuck (just like K-Means)
Works with any probability distribution (if E, M are solvable)

Question 39

Q

How does Single Linkage Clustering work?

Answer

A

Consider each object (i.e. data point) as its own cluster (i.e. n objects)
Define intercluster as distance between the CLOSEST two points in the two clusters. (So starting out the intercluster distance is just the interobject distance)
Merge two closest clusters
Repeat n-k times to make k clusters.

Question 40

Q

How many iterations are possible in K-Means?

Answer

A

Finite (exponential) iterations: O(K^n)

Question 41

Q

Describe the Soft Clustering algorithm?

Answer

A

**Assume data was generated by normal distribution for this example**

Select one of K gaussians (with a fixed known variance) uniformly
Sample x_i from that Gaussian
Repeat n times

Task: Find a hypothesis h= that maximizes the probability of the data (in terms of maximum likelihood)

Question 42

Q

What is one of the main differences between PCA and ICA?

Answer

A

PCA ~= Correlation
ICA ~= Independence

Question 43

Q

Why is the prior not included in the Expectation Maximization algorithm?

Answer

A

Because EM uses the maximum likelihood estimate, i.e. we assume a uniform prior, so it can just be dropped from the calculation in the EM update.

Question 44

Q

What are 3 reasons that feature selection is important?

Answer

A

Knowledge discovery
Interpretability and insight
Curse of dimensionality (amount of data we need grows exponentially as we increase the features)

Question 45

Q

What is the idea of “consistency” that Dr. Littman talks about in regards to clustering?

Answer

A

It’s the idea that if we make a bunch of points within a cluster more similar, or more the points in two clusters more dissimilar, we shouldn’t get different groupings.

Question 46

Q

What is the Bellman Equation?

Question 47

Q

If a PCA component has an eigenvalue of 0, what should we do with that component? Why?

Answer

A

Throw it away because it contains no useful information.

Question 48

Q

Single link clustering terminates fast?

Answer

A

True (mentioned in summary at end of lectures).

Question 49

Q

In terms of feature SELECTION, what is one feature TRANSFORMATION algorithm that is similar to filtering? Why?

Answer

A

PCA. It’s similar because once we align the new set of basis vectors that are aligned in the direction of maximal variance, we can throw away (i.e. filter) the components that don’t contribute to the variance.

Question 50

Q

If you have a strongly relevant feature, and you make a copy of it, then both features will be strongly relevant? (T/F)

Answer

A

False. By making a copy of it, you’ve made the feature redundant, so now the features are only weakly relevant.

Question 51

Q

What are some properties of PCA?

Answer

A

Maximizes variance
Orthogonal
Global algorithm (i.e. each of the PCs are orthogonal to one another)
Best reconstruction (minimizes L2 error moving from N to M dimensions)
Well studied, so lots of fast algorithms exist for it.

Question 52

Q

How does Forward Feature Selection work?

Answer

A

(start with a feature, get a score. Whichever is best keep. Then add another feature along with it, repeat, score, Keep going until you achieve some criterion, e.g. minimum sore, etc.

Question 53

Q

K-Means does not struggle with getting stuck in local optima? (T/F)

Answer

A

False. It’s definitely possible to get stuck if the initial random cluster assignment is bad.

Question 54

Q

Give an example of a PCA component that would have an eigenvalue of zero. What should we do with that component?

Answer

A

A PCA component with an eigenvalue of zero is irrelevant; it contains no useful information. You could think of points lying exactly on the line y=1. There would be zero entropy of the points, hence they would not be relevant (although they could possibly useful to something like a NN classifier in terms of prediction error). Therefore, If the eigenvalue is zero, we should just get rid of the component.

Question 55

Q

A feature x_i that is neither strongly relevant nor weakly relevant it is ________.

Answer

A

Irrelevant

Question 56

Q

At a high level, what is PCA doing?

Answer

A

It’s lining up a new set of orthogonal basis vectors that are aligned with the direction of the maximal variance of the data.

Another way of looking at it is that it is trying to find all the orthogonal gaussians in a dataset (because a Gaussian distribution is the distribution that maximizes variance!)

Question 57

Q

The error in K-Means can go up?

Answer

A

False. It is monotonically non-increasing in error. Things can only be reassigned to a new cluster if the error goes DOWN. This is because the average is the best way of describing the cluster in a least squares sense.

Note that this does require that ties are broken consistently, but as long as that is true, ERROR ALWAYS DECREASES FOR K-MEANS

Question 58

Q

What are some of the advantages and disadvantages of the Wrapping style of feature selection algorithms?

Answer

A

Major Pro: Takes into account the MODEL BIAS
Major Pro: Actually considers the LEARNING PROBLEM
Major Con: VERY Slow

Question 59

Q

What’s one way of connecting RL to some of the earlier lectures in the course?

Answer

A

The idea of rewards taking the place of teachers. Basically the reward acts as the “teaching signal” in place of an explicit teacher telling you what is good or bad.

You can also think of the rewards as taking the place of domain knowledge.

Question 60

Q

How many configuration spaces are possible in K-Means?

Question 61

Q

Single linkage clustering is a non-deterministic algorithm? (T/F). Justify your reason.

Answer

A

False. The distance between objects doesn’t change, so it’s deterministic.

Question 62

Q

What are three properties that are desirable for clustering algorithms?

Answer

A

Richness: Ability to assign any inputs to any clustering
Scale Invariance: scaling distances by a positive value does not change the clustering
Consistency: shrinking INTRA-cluster distances and expanding INTER-cluster distances does not change the clustering.

Question 63

Q

What is the computational complexity of choosing the best M features from a set of N features?

Answer

A

O(2^N). It is known to be an NP-HARD problem.

Question 64

Q

What is “Relevance” (in the context of feature selection) roughly equivalent to?

Answer

A

Information. This is because it’s all about the Bayes Optimal Classifier. Relevant things are THINGS THAT GIVE US INFORMATION.

Answer 62

A

False. Reward is all about immediate gratification, whereas utility is all about long-term (i.e. delayed rewards).

Answer 63

A

With filtering the scoring of the subset of features occurs INSIDE the feature selection algorithm, and then then once the best subset is found it is passed on to the learner.

Answer 64

A

x_i is strongly relevant if removing it degrades Bayes Optimal Classifier (BOC), i.e. if without that feature you couldn’t achieve a Bayes Optimal Classifier

Answer 65

A

RELEVANCE measures effect on BAYES OPTIMAL CLASSIFIER
USEFULNESS mesures effect on a PARTICULAR predictor

Answer 66

A

Compact description

Answer 67

A

False. The centroid need not actually be in the set of data points. It’s just describing the central tendency of the cluster. (i.e. the “Means” part of K-Means”)

Answer 68

A

PCA: The the world is made up of a bunch of Gaussians, so finding the most important ones is just a matter of finding the ones that are uncorrelated with one another.
ICA: The world consists of sources that are highly NON-Gaussian, but when added together as part of a linear combination they became Gaussian in the limit via the Central Limit Theorem.

Answer 69

A

That they are INDEPENDENT of one another (remember the microphone example)

Answer 70

A

Imagine a dodgeball team:

Using FFS, you’d start with the best player, and then continue adding best players until you reach some criterion.
Using BFS, you’d start by eliminating the worse player, and then continue eliminating until you’ve reached some criterion of minimum number of features.

Answer 71

A

Global, because of it’s Gaussian/maximal variance nature.

Answer 72

A

Median is non-metric statistic, Mean IS metric. The reason this is important is that metric statistics like the mean, the values of the data itself matter a lot.

Answer 73

A

K-Means, because it’s an iterative process of assigning X to having been generated by some set of latent variables Z, and then moving the means based on those latent assignments, and then beginning the process all over.

CLUSTERING –> MAXIMIZATION
then repeat until convergence

Answer 74

A

That amount of information some variable X contains about another variable Y.

Answer 75

A

False. It ends up being the same thing if the cluster assignments use the argmax. (Have to watch the EM lecture videos to see the math for this)

Answer 76

A

False. Ceterus paribus, leaving a feature in does at least two things that hurt performance:

Time
Sample efficiency. This is because of the curse of dimensionality, which says that we need exponentially more data as we increase the number of features.

Answer 77

A

With wrapping the feature selection and learner are embedded into the same algorithm, and iteratively passed back and forth: the feature selection algo outputting a subset of features, the learner then takes that as input, scores it, and passes the score back to the feature selection algo and the process starts all over again.

Answer 78

A

At 0, no learning would happen. At 1.0, you would basically just be discarding the old value and moving straight to the new estimate. 0.5 would then essentially just be the average between your new and old estimates.

Answer 79

A

Sum of LR must be infinite, some of squares of LR must be finite.

Answer 80

A

False. It does, but ONLY assuming all s, a pairs are visited infinitely often.

Answer 81

A

Describes the set of payoffs that can result from Nash Strategies in REPEATED games

Answer 82

A

Cooperation

Answer 83

A

Always best response independent of history.