Final Exam Flashcards

Question

Using entropy as a criterion for feature selection is difficult because you need to know the labels of the data? (T/F)

Answer 1

False. This is one of the big advantages of criterion like entropy over things like information gain, which does require labels.

Answer 2

1. Information gain 2. Variance, entropy 3. "Useful" features (e.g. maybe train a neural net and prune away features where the weights are zero) 4. Independent/non-redundant

Answer 3

One example would be a decision tree. You could use the feature importances to select a subset of the most important features, and then pass that onto another learner. The advantage of doing this is different learning algorithms suffer from different types of inductive biases, so there may be some advantage to doing feature selection using one algorithm but a different algo for that actual learning (e.g maybe you need a learner that's more robust to noise data, one that has faster inference times, one that fits within some memory/cpu requirements, etc.).

Answer 4

It's just the mean of the data, for same reasons we described in K-Means, proved using calculus that mean is best way of minimizing error in least squares sense.

Answer 5

False, and for the same reason as K-Means. We're still relying on random initialization of clusters, so if we get a bad initial assignment, it is still possible for EM to get stuck.

Answer 6

False. Reward is all about IMMEDIATE gratification. But the utility considers "how good is it to be in that state", which is a function of the (discounted) sum of the future rewards that I will get (in expectation) for starting in that state and following some policy pi. So utility is about LONG TERM REWARD.

Answer 7

It's the idea of "what is the best I could do" on a given learning problem. It's not any particular algorithm, it's given ALL algorithms, what is the best I could achieve.

Answer 8

True. (Dr. Isbell mentions that the argument for why this is the case is similar to the argument for why K-Means works.)

Answer 9

Hill climbing. This is because of the iterative process that maximizes the score (i.e. minimizes the error) as the centroids of the partitions are moved around.

Answer 10

Iterative process of searching for features (typically using some sort of randomized optimization algorithm) and then passing that feature subset to the learner to score it, and then repeating the process.

Answer 11

1 . Hill Climbing 2. Randomized Optimization 3. Forward selection

Answer 12

Random restarts (just like hill climbing)

Answer 13

Given N features, it is 2^N

Answer 14

1. It is monotonically non-decreasing in terms of likelihood 2. Does not converge (although usually does in practice) 3. Will not diverge 4. Can also get stuck (just like K-Means) 5. Works with any probability distribution (if E, M are solvable)

Answer 15

1. Consider each object (i.e. data point) as its own cluster (i.e. n objects) 2. Define intercluster as distance between the CLOSEST two points in the two clusters. (So starting out the intercluster distance is just the interobject distance) 3. Merge two closest clusters 4. Repeat n-k times to make k clusters.

Answer 16

Finite (exponential) iterations: O(K^n)

Answer 17

\*\*Assume data was generated by normal distribution for this example\*\* 1. Select one of K gaussians (with a fixed known variance) uniformly 2. Sample x\_i from that Gaussian 3. Repeat n times Task: Find a hypothesis h= that maximizes the probability of the data (in terms of maximum likelihood)

Answer 18

PCA ~= Correlation ICA ~= Independence

Answer 19

Because EM uses the maximum likelihood estimate, i.e. we assume a uniform prior, so it can just be dropped from the calculation in the EM update.

Answer 20

1. Knowledge discovery 2. Interpretability and insight 3. Curse of dimensionality (amount of data we need grows exponentially as we increase the features)

Answer 21

It's the idea that if we make a bunch of points within a cluster more similar, or more the points in two clusters more dissimilar, we shouldn't get different groupings.

Answer 22

Throw it away because it contains no useful information.

Answer 23

True (mentioned in summary at end of lectures).

Answer 24

PCA. It's similar because once we align the new set of basis vectors that are aligned in the direction of maximal variance, we can throw away (i.e. filter) the components that don't contribute to the variance.

Answer 25

False. By making a copy of it, you've made the feature redundant, so now the features are only weakly relevant.

Answer 26

1. Maximizes variance 2. Orthogonal 3. Global algorithm (i.e. each of the PCs are orthogonal to one another) 3. Best reconstruction (minimizes L2 error moving from N to M dimensions) 4. Well studied, so lots of fast algorithms exist for it.

Answer 27

(start with a feature, get a score. Whichever is best keep. Then add another feature along with it, repeat, score, Keep going until you achieve some criterion, e.g. minimum sore, etc.

Answer 28

False. It's definitely possible to get stuck if the initial random cluster assignment is bad.

Answer 29

A PCA component with an eigenvalue of zero is irrelevant; it contains no useful information. You could think of points lying exactly on the line y=1. There would be zero entropy of the points, hence they would not be relevant (although they could possibly useful to something like a NN classifier in terms of prediction error). Therefore, If the eigenvalue is zero, we should just get rid of the component.

Answer 30

Irrelevant

Answer 31

It's lining up a new set of orthogonal basis vectors that are aligned with the direction of the maximal variance of the data. Another way of looking at it is that it is trying to find all the orthogonal gaussians in a dataset (because a Gaussian distribution is the distribution that maximizes variance!)

Answer 32

False. It is monotonically non-increasing in error. Things can only be reassigned to a new cluster if the error goes DOWN. This is because the average is the best way of describing the cluster in a least squares sense. Note that this does require that ties are broken consistently, but as long as that is true, ERROR ALWAYS DECREASES FOR K-MEANS

Answer 33

Major Pro: Takes into account the MODEL BIAS Major Pro: Actually considers the LEARNING PROBLEM Major Con: VERY Slow

Answer 34

The idea of rewards taking the place of teachers. Basically the reward acts as the "teaching signal" in place of an explicit teacher telling you what is good or bad. You can also think of the rewards as taking the place of domain knowledge.

Answer 35

False. The distance between objects doesn't change, so it's deterministic.

Answer 36

1. Richness: Ability to assign any inputs to any clustering 2. Scale Invariance: scaling distances by a positive value does not change the clustering 3. Consistency: shrinking INTRA-cluster distances and expanding INTER-cluster distances does not change the clustering.

Answer 37

O(2^N). It is known to be an NP-HARD problem.

Answer 38

Information. This is because it's all about the Bayes Optimal Classifier. Relevant things are THINGS THAT GIVE US INFORMATION.

Answer 39

False. Reward is all about immediate gratification, whereas utility is all about long-term (i.e. delayed rewards).

Answer 40

With filtering the scoring of the subset of features occurs INSIDE the feature selection algorithm, and then then once the best subset is found it is passed on to the learner.

Answer 41

x\_i is strongly relevant if removing it degrades Bayes Optimal Classifier (BOC), i.e. if without that feature you couldn't achieve a Bayes Optimal Classifier

Answer 42

1. RELEVANCE measures effect on BAYES OPTIMAL CLASSIFIER 2. USEFULNESS mesures effect on a PARTICULAR predictor

Answer 43

Compact description

Answer 44

False. The centroid need not actually be in the set of data points. It's just describing the central tendency of the cluster. (i.e. the "Means" part of K-Means")

Answer 45

1. PCA: The the world is made up of a bunch of Gaussians, so finding the most important ones is just a matter of finding the ones that are uncorrelated with one another. 2. ICA: The world consists of sources that are highly NON-Gaussian, but when added together as part of a linear combination they became Gaussian in the limit via the Central Limit Theorem.

Answer 46

That they are INDEPENDENT of one another (remember the microphone example)

Answer 47

Imagine a dodgeball team: 1. Using FFS, you'd start with the best player, and then continue adding best players until you reach some criterion. 2. Using BFS, you'd start by eliminating the worse player, and then continue eliminating until you've reached some criterion of minimum number of features.

Answer 48

Global, because of it's Gaussian/maximal variance nature.

Answer 49

Median is non-metric statistic, Mean IS metric. The reason this is important is that metric statistics like the mean, the values of the data itself matter a lot.

Answer 50

K-Means, because it's an iterative process of assigning X to having been generated by some set of latent variables Z, and then moving the means based on those latent assignments, and then beginning the process all over. CLUSTERING --\> MAXIMIZATION then repeat until convergence

Answer 51

That amount of information some variable X contains about another variable Y.

Answer 52

False. It ends up being the same thing if the cluster assignments use the argmax. (Have to watch the EM lecture videos to see the math for this)

Answer 53

False. Ceterus paribus, leaving a feature in does at least two things that hurt performance: 1. Time 2. Sample efficiency. This is because of the curse of dimensionality, which says that we need exponentially more data as we increase the number of features.

Answer 54

With wrapping the feature selection and learner are embedded into the same algorithm, and iteratively passed back and forth: the feature selection algo outputting a subset of features, the learner then takes that as input, scores it, and passes the score back to the feature selection algo and the process starts all over again.

Answer 55

At 0, no learning would happen. At 1.0, you would basically just be discarding the old value and moving straight to the new estimate. 0.5 would then essentially just be the average between your new and old estimates.

Answer 56

Sum of LR must be infinite, some of squares of LR must be finite.

Answer 57

False. It does, but ONLY assuming all s, a pairs are visited infinitely often.

Answer 58

Describes the set of payoffs that can result from Nash Strategies in REPEATED games

Answer 59

Cooperation

Answer 60

Always best response independent of history.

Final Exam Flashcards

(87 cards)