Exam 2 Flashcards
Unsupervised Learning
- Data Description
Single Linkage Cluster - SLC
- Consider each object a cluster (n objects)
- Define inter-cluster distance as the distance between the closest points in the two clusters
- Merge the two closest clusters
- Repeat n-k times to make k clusters
- How you define the inter-cluster distance = domain knowledge
Single Linkage Cluster Run Time
O(n^3)
Issues with SLC
Might end up with wrong clusters, depending on the distance definition
k-means clustering
- Pick k random center points
- Each center claims its closest points
- Recompute the centers by averaging the clustered points
- Repeat until converge
Configurations (k-means as an optimization problem)
center
P
Scores (k-means as an optimization problem)
How much error you introduce by representing these points as centers
Neighborhood (k-means as an optimization problem)
The neighborhood of a configuration is the set of pairs where you keep the centers the same and you change the partitions or you keep the partitions the same and you move the centers
Properties of k-means clustering
- Each iteration is polynomial O(kn)
- Finite (exponential) iterations O(k^n)
- Error decreases (if ties are broken consistently)
- Can get stuck, if it started with wrong centers. This is similar to converging to a local optima. One solution is to use random restarts.
Soft Clustering
Attaches cluster probability to each point instead of specific cluster
Task is to find a hypothesis that maximizes the probability of the data (Maximum likelihood)
Maximum Likelihood Gaussian
The Maximum Likelihood mean of the Gaussian u is the mean of the data
Expectation Maximization
- Assigns a cluster probability to each point
- Use the new u_j to recompute E[Z_ij]
- Expectation: E[Z_ij] defines the probability that element i was produced by cluster j.
- Maximization: u_j defines the mean of cluster j
EM properties
- Monotonically non-decreasing likelihood: It’s not getting worse
- Does not converge
- Will not diverge
- Most probably will get stuck in a local optima. Use random restarts
- Works with any distribution (if E and u are solvable)
Richness (Clustering Properties)
For any assignment C of objects to clusters, there’s some distance matrix D such that P_d returns that clustering C
Scale-invariance (Clustering Properties)
Scaling distances by a positive value doesn’t change the clustering
Consistency (Clustering Properties)
Shrinking intra-cluster distances (moving points towards each other) and expanding inter-cluster (moving clusters away from each other) distances doesn’t change the clustering
Impossibility Theorem (Clustering Properties)
There’s no clustering algorithm that can achieve the three properties
Time to select most relevant subset of features M, where M <= N
2^N
Approaches to Feature Selection
Filtering and Wrapping
Filtering (Feature Selection)
- Apply a search algorithm to produce fewer features to be passed to the learning algorithm
- Faster than wrapping
- Doesn’t account for the relationships between features
- Ignores the learning process (no feedback)
- We can use an algorithm (ex: Decision Tree) that can select the best features, then use another algorithm, with another inductive bias, for learning
- We can use different criterion to evaluate the usefulness of a subset of features:
information gain, variance, entropy, eliminate dependent features
Wrapping
- Given a set of features, apply a search algorithm to produce fewer features, pass them to the learning algorithm, then use the output to update the selected set of features
- Takes into account the learning model’s bias
- Very slow
We can use different techniques to search:
- Randomized Optimization algorithms
- Forward selection: select a feature and evaluate the effect of creating different combinations of this feature with other features. Stop once you stagnate. This is similar to hill climbing
- Backward Elimination: start with all the features and evaluate the effect of eliminating each feature
Forward selection
select a feature and evaluate the effect of creating different combinations of this feature with other features. Stop once you stagnate. This is similar to hill climbing
Backward Elimination
start with all the features and evaluate the effect of eliminating each feature
x_i is <> relevant if removing it degrades the Bayes Optimal Classifier
strongly