Lecture 3 - Unsupervised Machine Learning Flashcards
What are some examples of unsupervised machine learning?
- outlier detection
- similarity search
- association rules
- data visualization
- clustering
Describe Clustering. What inputs does it take? What is the output?
Clustering is a way of grouping data into a number of clusters without having labels present
Input: Set of objects described by features xi
Output: An assignment of objects into “groups”
Unlike classification, we are not given the “groups”. The algorithm must figure these groups out
Can you give some examples of use cases for clustering?
- define market segments by clustering customers
- study social networks by recognizing communities
- recommendation systems (Amazon recommending products, Netflix recommending shows)
How do you normalize/scale data?
You can either
- Scale data from 0-1
- Normalise using the Z-score (x’=x-μ)/σ): Transform the data so that it is expressed as σ from the mean
What is K-Means Clustering? What is the input of the algorithm? What are the assumptions? Describe the 4 steps in the algorithm.
K Means clustering is one of the most popular clustering methods
Input:
- The number of clusters ‘k’ (hyperparameter)
Assumptions:
- The center of each cluster is the mean of all samples belonging to that cluster
- Each sample is closer to the center of its own cluster than to the center of other clusters
The four steps are like so:
- Initial guess of the center (the “mean”) of each cluster
- Assign each xi to its closest mean
- Update the means based on the cluster assignments
- Repeat steps 2-3 until convergence
What are the assumptions of K-Means clustering?
The center of each cluster is the mean of all samples belonging to that cluster
Each sample is closer to the center of its own cluster than to centers of other clusters
How can you relate K-Means clustering to set theory?
We can interpret K-Means steps as trying to minimize an objective:
Given a set of observations (x1,x2,…,xn) the algorithm’s goal is to partition the n observations into k sets S={S1,S2,…,Sk} so as to minimize the within-cluster sum of squares:
{See the rest of the math in Notion}
How can you determine how many K’s in K-Means clustering?
You can determine how many clusters using:
- Elbow Method
- Silhouette analysis
What is the Elbow Method?
Elbow Method:
- Run K-means for several k
- Distortion: Sum of distances of each point to the center of the closest cluster
- Look for k where the curve stops decreasing rapidly
What is silhouette analysis?
Thickness of the plot shows the size of the cluster (how many datapoints are assigned to the cluster)
The groups in the graph should be approximately similar in terms of the sihouette coefficient, they should not be under the mean of the s. coeff., and hopefully they would also be approximately of the same thickness (unless you can clearly see the diff. between the clusters that they really differ in terms of size)
What are some issues with K-Means clustering?
Final cluster assignment depends on initialization of centers
- Cluster assignments may vary on different runs
- May not achieve global optimum
Assumes you know the number of clusters ‘k’
- Lots of heuristic approaches to picking ‘k’
Each object is assigned to one (and only one) cluster:
- No possibility for overlapping clusters or leaving objects unassigned
- Fuzzy clustering/soft k-means allows assigning to many
Sensitive to scale
When is a set convex?
A set is convex if a line between to points in the set stays in the set (See images on Notion)
Can K-Means cluster into non-convex sets?
No, K-Means clusters cannot
What is Density based Clustering
- Clusters are defined by “dense” regions
- It’s deterministic, meaning that it always gives the same clusters
- No fixed number of clusters ‘k’, determines them by itself
- Objects in non-dense regions don’t get clustered
i. e not trying to “partition” the space - Clusters can be non-convex, i.e you can find clusters of any shape
What is DBSCAN? Which hyperparameters does it have?
DBSCAN is a density based clustering algorithm.
It has two hyperparameters:
- Epsilon(ε): Distance we use to decide if another point is a “neighbour”.
- MinNeighbours: Number of neighbours needed to say a region is “dense”
If you have at least minNeighbours “neighbours”, you are called a “core point”