Final Exam Flashcards
EM always converges? (T/F)
False. In practice it usually does, but since the configuration space is infinite the only guarantee we have is that it will not DIVERGE.
What is a WEAKLY RELEVANT feature
- Not strongly relevant
- Exists a subset of features S such that adding x_i to S improves the Bayes Optimal Classifier (BOC)
Why is feature selection difficult?
Because from a theoretical perspective, we would have to score all the possible subsets. Given N features where we desire to downsize to M features, this requires 2^n runs. This means the feature selection problem is EXPONENTIAL, and is known to be NP-HARD.
What is “Usefulness” (in the context of feature selection) roughly equivalent to?
Error GIVEN the model/learner (for example, consider a neural net; some feature x_i might be useful in the sense that performing an operating like y=w.T*x enables the model to achieve better predictions, i.e. lower error predictions, on average, but might still not be “relevant” to a Bayesian Optimal Classifier)
PCA will never find independent components? (T/F)
False. Because of the mutual orthogonality and maximal variance constraints, there ARE cases where it will find independent components by virtue of trying to find components that are uncorrelated. However, uncorrelated != statistical independence necessarily.
Where does domain knowledge get encoded in Filtering style feature selection algorithms?
In the criterion used for the search algorithm, e.g. information gain, variance, entropy, “usefulness”, non-redundancy, etc.
What is the clustering impossibility theorem?
It’s the reality that we can have a clustering algorithm that all have (because they’re mutually exclusive, you can at best get 2/3 of them):
- Richness
- Scale Invariance
- Consistency
This was proved by John Kleinberg
What are the components of an MDP?
SART; i.e. states, actions, rewards, transitions. (Dr. Isbell notes that some people include the discount factor gamma, which he disagrees with since he thinks it’s part of the definition of the problem.)
Policy iteration eliminates the “max” term in the Bellman equation? (True/False)
True. Since in policy iteration we’re already following the policy, we no longer have to compute over all the actions, so that get’s rid of the nonlinear function.
The reason this is useful is because once we get rid of the nonlinear operation, we can use things like matrix inversion, etc. to solve the n equations and n unknowns in the Bellman equation in linear time. This is MORE COMPUTATIONALLY EXPENSIVE (roughly n^3 for inversions, also is a function of dimensionality) than something like say value iteration, but in theory it should converge to the optimal policy in FEWER ITERATIONS.
Another thing that follows from this is that when performing policy iteration, the jumps in utility will be bigger than in value iteration because we’re iterating policy space instead of value space.
What is one case where PCA would return the same components as ICA?
If the data are Gaussian. This is because the distribution that maximizes variance IS the normal distribution (remember that PCA is trying to find the basis vectors that maximize variance!)
The configuration space for both K-Means and EM is finite? (T/F)
False. K-Means is finite because the cluster assignments are HARD. There are only so many ways we can assign n points to K clusters (K^n in fact), hence finite.
EM on the other hand works in probability space, so there’s an infinite number of possible configurations.
The key takeaway from this is that while K-MEANS IS GUARANTEED TO CONVERGE, THE ONLY THING GUARANTEED BY EM IS THAT IT WONT DIVERGE!
What is the intercluster distance of two objects that are in the same cluster?
Really a trick question. Answer is zero
What is the time complexity of SLC?
O(n^3)
What is the time complexity each iteration of K-Means?
O(kn), or O(knd) if we’re considering multiple dimensions d
Describe filtering style feature selection?
Use some criterion for filtering, e.g. information gain, variance, entropy, “usefulness”, non-redundancy, etc.
Important thing to consider is that THE LEARNER IS NOT INCLUDED.
How does K-Means clustering algo work
- Pick K “centers” (at random)
- Each center “claims” its closest points
- Recompute the centers by averaging the clustered points
- Repeat until convergence
Which part of the EM algorithm is usually the most computationally expensive?
The E (expectation) step. This is because we’re doing some sort of inference stuff there, e.g. Bayes nets, etc, whereas the M (maximization) step typically just involves counting things.
The lectures mention this isn’t ALWAYS true, but in general it’s a good heuristic.
What are some of the advantages and disadvantages of the Filtering style of feature selection algorithms?
- Speed:
Pro:
Doesn’t involve learner, so FS is fast
Con:
Isolated features - maybe a feature on its own doesn’t seem important, but when combined with another one it is. Since filtering doesn’t include the learner, no way of knowing this.
More generally, the major con is that it ignores the actually learning problem.
What is the intuition behind K-Means?
It’s an interative process of randomly partitioning a dataset, and then finding the centroids of each of those partitions, then going back to step one and repartitioning based on the new centroid locations.
This is important because it’s what allows us to think about K-Means in terms of OPTIMIZATION (specifically hill climbing).
How is PCA in some sense a method for determining correlation?
Because it is maximizing variance in the data.
Would ICA be better at finding global or local features?
Local. The lecture videos bring up an interesting point how in natural scenes ICA actually recovers EDGES.
What are the two broad categories of Feature Selection algorithms?
Filtering and Wrapping
What are some other types of hierarchical agglomerative cluster models besides SLC?
max, average, median
What is the main advantage of Random Component Analysis?
It’s very fast.
