Final Prep Flashcards
Clustering Onwards
How does single linkage clustering work? (13.4)
- Consider each object (sample) a cluster (n objects)
- define intercluster distance as distance b/t the closest two points in clusters with different labels
- merge two closest clusters
- repeat n-k times to make k clusters
If two points are in the same cluster, the intercluster distance is 0.
You end up creating a tree structure with the root being everything in a single cluster.
How you define distance is one way of inserting domain knowledge.
This typically leads to chain like structures.
When might median be a good distance metric in single linkage clustering? (13.4)
When the value of the distance itself doesn’t really matter, but the order is what is most important.
What is the a metric vs non metric statistic? (13.4)
Metric: the specific values matter a lot (average, the details of the number matter)
Non-Metric: the specific values don’t matter, but rank order does (eg median)
What’s the running time of Single Linkage Clustering? (13.5)
O(n^3) because:
for k times:
find distance b/t each point and next closest point which is O(n^2) and do that n times.
How does K-Means clustering work? (13.7)
- Pick K rand points to be the centers of the k clusters
- Each center “claims” its closest points
- Recompute centers by averaging clustered points (new center doesn’t have to be a point!)
- Repeat until clusters no longer change
Why does K-means clustering converge? (13.11)
The error is monotonically non-increasing.
When updating the partition:
•P(x) = argmin_i || x - center_i ||^2
•You’re finding the partition that is closest to x
• You only move a point from from cluster to another if that distance (error) gets smaller. So the error can only every be 0 or decreasing
When updating the center of each partition (without changing the partition):
•center_i = sum(y) / |C_i|
•Average minimizes the squared error (take derivative of E and set it equal to 0)
For this to work you must have some way of breaking ties that is deterministic and consistent.
You will never revisit a configuration you’ve already looked at.
What randomized optimization technique is most like K-means clustering? (13.10/11)
Hill Climbing - each time you set P(x) for all points and then readjust the center, you are always either decreasing the error or the error doesn’t change. So always improvement or no improvement, but never de-provement.
What are the properties of k-means clustering? (13.12)
• Each iteration is polynomial: O(k*n) distance between the k centers and points (w/ extra factor of d if d dimensional points)
• Finite (exponential) iterations: O(k^n)
• Error decreases if ties are broken consistently
• Exception is that if it doesn’t improve, the next time you would see if has converged (double check that understanding?)
Can k-means clustering get stuck and if so how could you avoid it? (13.12)
Yes it can get stuck!
a(b) (c)(d)
e f
if (x) are the cluster centers, then (b) is going to claim a, b, e, and f and c and d will just stay where they are.
Can be avoided via random restarts or selecting points on the fringe or making sure the cluster centers are as far away form each other as possible.
What is soft clustering and how does it work? (13.14)
This is similar to bayesian learning where we are trying to find a hypothesis that maximizes the probability of the data.
Assume data was generated as follows:
• Select one of K gaussians [ with fixed variance] uniformly
• Sample x_i from that gaussian
• Repeat n times
Goal: Find a hypothesis h = that maximizes probability of the data (Max. likelihood)
This means that each data point could probabilistically belong to more than one cluster.
What is the ML mean of the gaussian if there is just one K? (13.15)
If you know all the data is coming from the same gaussian, then the ML mean is just the mean of the data.
What is expectation maximization (EM)? (13.16/17)
Moving back and forth between soft clustering and computing the means for the soft clustering.
This is conceptually much like k-means, but you’re assigning the probability of being in each cluster to each given datapoint.
What are the properties of EM? (13.18)
• Each iteration of EM is a monotonically non-decreasing likelihood (finding higher and higher likelihoods)
• Does not have to converge, but it will not diverge either
• Can get stuck (happens ALL THE TIME in practice; many local optima. Solution: random restarts)
• Works with any distribution (if E, M are both solvable)
• usually the E step is the harder part and M is just basic math
What are the desirable properties of clustering algorithms? (13.19)
• Richness: For any assignment of objects to clusters, there is some distance matrix D that P_d returns that clustering. You can’t force it to have a specific number of clusters. can represent any number of clusters. (13.20).
• Scale-Invariance: Scaling distances by a positive value does not change the clustering (going from miles to kilometers may change the values, but it doesn’t change the clustering)
• Consistency: Shrinking intra-cluster distances (dist b/t samples in same cluster) and expanding inter-cluster distances (dist b/t the clusters) does not change the clustering, P_d = P_d’
Which clustering scheme can achieve all the desirable properties of k means clustering? (13.21)
NONE! Impossibility theorem (Kleinberg) says no clustering scheme can achieve all three:
• richness
• scale invariance
• consistency
Why do we bother with feature selection? (14.2)
• Curse of dimensionality
• Allows you to focus on the features that matter and ignore those that don’t
What is filtering vs wrapping in feature selection? (14.4)
Filtering:
N Features -> Search Algorithm -> Fewer Features
• The criterion of how you know if you’re doing well is buried within the search algorithm
• Can use the labels in your filter
Wrapping:
N Features -> [ Search Learning Alg.] -> Fewer Features
• In this case you pass the search results to learning algorithm, it reports back how well it does with those features and this goes back and forth until you come to a solution
What are the pros and cons of filtering and wrapping? (14.5)
Filtering:
+ Speed!
- Speed … you’re really looking at isolated features and aren’t considering feature dependence
- ignores the learning problem
Wrapping:
+ take into account model bias and learning!
- But it’s really really slow
What are some ways in which you could do filtering? (14.6)
• Information gain
• variance entropy
• ‘useful’ features
• independent/non-redundant features
The goal is the make the learning problem easier by reducing the number of features which reduces the number of samples necessary, up until the point where it makes the learning problem harder because you just don’t have enough differentiating features.
What are some ways in which you could do wrapping? (14.6/7)
•Any randomized optimization algorithm
• Forward Search (a form of hill climbing)
• Backward Search (a form of hill climbing)
How does forward search work? (14.7)
• Go through each feature one at a time and pass it to the learning algorithm and get the score for each
• Keep the feature with the highest score
• Now combine that one feature with every other again, passing it to the learner and getting back a score
• Again keep the best combination
• Repeat this until adding a feature doesn’t produce significant gains to the learner
How does backward search work? (14.7)
Say you start with 5 (N) features: 1, 2, 3, 4, 5. In this case you’re trying to eliminate one feature at a time.
• Pass all combinations of N-1 to the learner and get result:
• 1, 2, 3, 4
• 1, 2, 4, 5
•1, 3, 4, 5
• 2, 3, 4, 5
• Keep the best combination and then repeat until your score peaks and then goes down
This is like eliminating people during tryouts.
When is a feature strongly relevant? (14.9)
If removing that feature degrades the Bayes Optimal Classifier, then it is strongly relevant.
That is to say, if you are considering ALL features and you remove this one feature and the BOC degrades, it is strongly relevant.
You can make a feature not strong just by putting a copy in there. (14.11)
When is a feature weakly relevant? (14.9)
If it is NOT strongly relevant AND adding it to some other subset of features would improve the Bayes Optimal Classifier.
So if you have a subset of features and when you add this feature it improve the BOC, it is weakly relevant.
If a feature is not strongly relevant and it is not weakly relevant then it is irrelevant.
What does feature relevance do? (14.10)
It measures the effect on the Bayes Optimal Classifier (increase or decrease). Relevance is about how much information a feature provides.
Example:
If a binary feature C is 1 for all samples, it has 0 entropy and is irrelevant.