Final Exam Flashcards
What is a WEAKLY RELEVANT feature
- Not strongly relevant
- Exists a subset of features S such that adding x_i to S improves the Bayes Optimal Classifier (BOC)
Why is feature selection difficult?
Because from a theoretical perspective, we would have to score all the possible subsets. Given N features where we desire to downsize to M features, this requires 2^n runs. This means the feature selection problem is EXPONENTIAL, and is known to be NP-HARD.
EM always converges? (T/F)
False. In practice it usually does, but since the configuration space is infinite the only guarantee we have is that it will not DIVERGE.
What is “Usefulness” (in the context of feature selection) roughly equivalent to?
Error GIVEN the model/learner (for example, consider a neural net; some feature x_i might be useful in the sense that performing an operating like y=w.T*x enables the model to achieve better predictions, i.e. lower error predictions, on average, but might still not be “relevant” to a Bayesian Optimal Classifier)
PCA will never find independent components? (T/F)
False. Because of the mutual orthogonality and maximal variance constraints, there ARE cases where it will find independent components by virtue of trying to find components that are uncorrelated. However, uncorrelated != statistical independence necessarily.
Where does domain knowledge get encoded in Filtering style feature selection algorithms?
In the criterion used for the search algorithm, e.g. information gain, variance, entropy, “usefulness”, non-redundancy, etc.
What is the clustering impossibility theorem?
It’s the reality that we can have a clustering algorithm that all have (because they’re mutually exclusive, you can at best get 2/3 of them):
- Richness
- Scale Invariance
- Consistency
This was proved by John Kleinberg
What are the components of an MDP?
SART; i.e. states, actions, rewards, transitions. (Dr. Isbell notes that some people include the discount factor gamma, which he disagrees with since he thinks it’s part of the definition of the problem.)
Policy iteration eliminates the “max” term in the Bellman equation? (True/False)
True. Since in policy iteration we’re already following the policy, we no longer have to compute over all the actions, so that get’s rid of the nonlinear function.
The reason this is useful is because once we get rid of the nonlinear operation, we can use things like matrix inversion, etc. to solve the n equations and n unknowns in the Bellman equation in linear time. This is MORE COMPUTATIONALLY EXPENSIVE (roughly n^3 for inversions, also is a function of dimensionality) than something like say value iteration, but in theory it should converge to the optimal policy in FEWER ITERATIONS.
Another thing that follows from this is that when performing policy iteration, the jumps in utility will be bigger than in value iteration because we’re iterating policy space instead of value space.
What is one case where PCA would return the same components as ICA?
If the data are Gaussian. This is because the distribution that maximizes variance IS the normal distribution (remember that PCA is trying to find the basis vectors that maximize variance!)
The configuration space for both K-Means and EM is finite? (T/F)
False. K-Means is finite because the cluster assignments are HARD. There are only so many ways we can assign n points to K clusters (K^n in fact), hence finite.
EM on the other hand works in probability space, so there’s an infinite number of possible configurations.
The key takeaway from this is that while K-MEANS IS GUARANTEED TO CONVERGE, THE ONLY THING GUARANTEED BY EM IS THAT IT WONT DIVERGE!
What is the intercluster distance of two objects that are in the same cluster?
Really a trick question. Answer is zero
What is the time complexity of SLC?
O(n^3)
What is the time complexity each iteration of K-Means?
O(kn), or O(knd) if we’re considering multiple dimensions d
Describe filtering style feature selection?
Use some criterion for filtering, e.g. information gain, variance, entropy, “usefulness”, non-redundancy, etc.
Important thing to consider is that THE LEARNER IS NOT INCLUDED.
How does K-Means clustering algo work
- Pick K “centers” (at random)
- Each center “claims” its closest points
- Recompute the centers by averaging the clustered points
- Repeat until convergence
Which part of the EM algorithm is usually the most computationally expensive?
The E (expectation) step. This is because we’re doing some sort of inference stuff there, e.g. Bayes nets, etc, whereas the M (maximization) step typically just involves counting things.
The lectures mention this isn’t ALWAYS true, but in general it’s a good heuristic.
What are some of the advantages and disadvantages of the Filtering style of feature selection algorithms?
Pro:
Speed. Doesn’t involve the learner, so FS is fast
Con:
Isolated features - maybe a feature on its own doesn’t seem important, but when combined with another one it is. Since filtering doesn’t include the learner, there is no way of knowing this.
More generally, the major con is that it ignores the actual learning problem.
What is the intuition behind K-Means?
It’s an interative process of randomly partitioning a dataset, and then finding the centroids of each of those partitions, then going back to step one and repartitioning based on the new centroid locations.
This is important because it’s what allows us to think about K-Means in terms of OPTIMIZATION (specifically hill climbing).
How is PCA in some sense a method for determining correlation?
Because it is maximizing variance in the data.
Would ICA be better at finding global or local features?
Local. The lecture videos bring up an interesting point how in natural scenes ICA actually recovers EDGES.
What are the two broad categories of Feature Selection algorithms?
Filtering and Wrapping
What are some other types of hierarchical agglomerative cluster models besides SLC?
max, average, median
What is the main advantage of Random Component Analysis?
It’s very fast.
Using entropy as a criterion for feature selection is difficult because you need to know the labels of the data? (T/F)
False. This is one of the big advantages of criterion like entropy over things like information gain, which does require labels.
What are some filtering criteria you could use for feature selection?
- Information gain
- Variance, entropy
- “Useful” features (e.g. maybe train a neural net and prune away features where the weights are zero)
- Independent/non-redundant
What learning algorithm could you use as a Filtering style feature selector? Describe how it could be used.
One example would be a decision tree. You could use the feature importances to select a subset of the most important features, and then pass that onto another learner. The advantage of doing this is different learning algorithms suffer from different types of inductive biases, so there may be some advantage to doing feature selection using one algorithm but a different algo for that actual learning (e.g maybe you need a learner that’s more robust to noise data, one that has faster inference times, one that fits within some memory/cpu requirements, etc.).
Given a set of points, what is the maximum likelihood estimate for the mean of a Gaussian centered around that dataset?
It’s just the mean of the data, for same reasons we described in K-Means, proved using calculus that mean is best way of minimizing error in least squares sense.
Unlike K-Means, EM can never get stuck in local optima? (T/F)
False, and for the same reason as K-Means. We’re still relying on random initialization of clusters, so if we get a bad initial assignment, it is still possible for EM to get stuck.
The reward we get for being in some state is the same thing as the utility for that state? (True/False)
False. Reward is all about IMMEDIATE gratification. But the utility considers “how good is it to be in that state”, which is a function of the (discounted) sum of the future rewards that I will get (in expectation) for starting in that state and following some policy pi. So utility is about LONG TERM REWARD.
What is the Bayes Optimal Classifier
It’s the idea of “what is the best I could do” on a given learning problem. It’s not any particular algorithm, it’s given ALL algorithms, what is the best I could achieve.
Policy iteration is guaranteed to converge to the optimal policy? (True/False)
True. (Dr. Isbell mentions that the argument for why this is the case is similar to the argument for why K-Means works.). The policy iteration algorithm converges within fewer iterations. As a result, the policy iteration is reported to conclude faster than the value iteration algorithm.
Which random optimization algorithm is like K-Means? Why is that?
Hill climbing. This is because of the iterative process that maximizes the score (i.e. minimizes the error) as the centroids of the partitions are moved around.
Describe wrapping style feature selection?
Iterative process of searching for features (typically using some sort of randomized optimization algorithm) and then passing that feature subset to the learner to score it, and then repeating the process.