CS 7641 Final Flashcards
Supervised Learning
Use labeled training data to generalize labels to new instances
(function approximation)
unsupervised learning
making sense out of unlabeled data
data description
Single Linkage Clustering
- Consider each object a cluster (n objects)
- Define intercluster distance as the distance between the two closest two points in the two clusters
- Merge two closest clusters
- Repeat n-k times to make k clusters
Intercluster Distance in Single Linkage Clustering
the distance between the 2 closest points in the 2 clusters
Inter cluster distance represents
domain knowledge
T/F Single Link Clustering is Deterministic
True. Unless there are ties, it will give us an answer.
In SLC, if distance can be represented by edge link of a graph, what algorithm is it the same as?
Minimum Spanning Tree
What is the running time of Single Link Clustering (SLC) in terms of n points and k clusters
O(n^3). O(n^2) to look at all distances to find closest pairs and have to do this n times
Issues with SLC
Could end up with weird clusters that surround other clusters
k-means clustering algorithm
- pick k centers (art random)
- each center “claims” its closest points
- recompute the centers by averaging the clustered points
- repeat until convergence
When does k-means converage
When recomputing the centers does not move
T/F the centers have to be a point in the collection of objects
False
k-means in euclidean space process
circle between partition assignment and recalculating the centers
Which optimization algorithm is most like k-means? HC, GA, or SA
Hill Climbing. You take a step towards a configuration that is better than you have before
K-means runs out of configurations and converges in finite time when
finite configurations, break ties consistently, and never go into a configuration with a higher error
Big O for Each iteration of k-means clustering is
polynomial (pretty fast). O(k n): look at distance between each k center and n points and run through all the points to redefine clusters. could also have extra d if d dimensions
Big O for Finite iterations of k-means
exponential O(k^n). first of n objects can be in any of k clusters, 2nd of n objects can be in k clusters, etc. In practice, it is much faster
If ties are broken consistently in k-means then error
decreases [with one exception]
T/F k-means cannot get stuck
False (local optima can still happen)
How to avoid k-means getting stuck
random restart OR do initial analysis and pick centers in good spots
Soft Clustering Overview
Points can be probabilistically from one cluster or another
Soft Clustering Idea
Assume the data was generated by
1. select one of k gaussians uniformly [fixed known variance]
2. sample x from Gaussian
3. repeat n times
Task: find a hypothesis that maximizes the prob of the data (ML)
What is the maximum likelihood gaussian (soft clustering)
The ML mean of the Gaussian is the mean of the data
Expectation Maximum
Has two phases expectation which is soft clustering and maximization