Lecture 3 - Unsupervised Machine Learning Flashcards
What are some examples of unsupervised machine learning?
- outlier detection
- similarity search
- association rules
- data visualization
- clustering
Describe Clustering. What inputs does it take? What is the output?
Clustering is a way of grouping data into a number of clusters without having labels present
Input: Set of objects described by features xi
Output: An assignment of objects into “groups”
Unlike classification, we are not given the “groups”. The algorithm must figure these groups out
Can you give some examples of use cases for clustering?
- define market segments by clustering customers
- study social networks by recognizing communities
- recommendation systems (Amazon recommending products, Netflix recommending shows)
How do you normalize/scale data?
You can either
- Scale data from 0-1
- Normalise using the Z-score (x’=x-μ)/σ): Transform the data so that it is expressed as σ from the mean
What is K-Means Clustering? What is the input of the algorithm? What are the assumptions? Describe the 4 steps in the algorithm.
K Means clustering is one of the most popular clustering methods
Input:
- The number of clusters ‘k’ (hyperparameter)
Assumptions:
- The center of each cluster is the mean of all samples belonging to that cluster
- Each sample is closer to the center of its own cluster than to the center of other clusters
The four steps are like so:
- Initial guess of the center (the “mean”) of each cluster
- Assign each xi to its closest mean
- Update the means based on the cluster assignments
- Repeat steps 2-3 until convergence
What are the assumptions of K-Means clustering?
The center of each cluster is the mean of all samples belonging to that cluster
Each sample is closer to the center of its own cluster than to centers of other clusters
How can you relate K-Means clustering to set theory?
We can interpret K-Means steps as trying to minimize an objective:
Given a set of observations (x1,x2,…,xn) the algorithm’s goal is to partition the n observations into k sets S={S1,S2,…,Sk} so as to minimize the within-cluster sum of squares:
{See the rest of the math in Notion}
How can you determine how many K’s in K-Means clustering?
You can determine how many clusters using:
- Elbow Method
- Silhouette analysis
What is the Elbow Method?
Elbow Method:
- Run K-means for several k
- Distortion: Sum of distances of each point to the center of the closest cluster
- Look for k where the curve stops decreasing rapidly
What is silhouette analysis?
Thickness of the plot shows the size of the cluster (how many datapoints are assigned to the cluster)
The groups in the graph should be approximately similar in terms of the sihouette coefficient, they should not be under the mean of the s. coeff., and hopefully they would also be approximately of the same thickness (unless you can clearly see the diff. between the clusters that they really differ in terms of size)
What are some issues with K-Means clustering?
Final cluster assignment depends on initialization of centers
- Cluster assignments may vary on different runs
- May not achieve global optimum
Assumes you know the number of clusters ‘k’
- Lots of heuristic approaches to picking ‘k’
Each object is assigned to one (and only one) cluster:
- No possibility for overlapping clusters or leaving objects unassigned
- Fuzzy clustering/soft k-means allows assigning to many
Sensitive to scale
When is a set convex?
A set is convex if a line between to points in the set stays in the set (See images on Notion)
Can K-Means cluster into non-convex sets?
No, K-Means clusters cannot
What is Density based Clustering
- Clusters are defined by “dense” regions
- It’s deterministic, meaning that it always gives the same clusters
- No fixed number of clusters ‘k’, determines them by itself
- Objects in non-dense regions don’t get clustered
i. e not trying to “partition” the space - Clusters can be non-convex, i.e you can find clusters of any shape
What is DBSCAN? Which hyperparameters does it have?
DBSCAN is a density based clustering algorithm.
It has two hyperparameters:
- Epsilon(ε): Distance we use to decide if another point is a “neighbour”.
- MinNeighbours: Number of neighbours needed to say a region is “dense”
If you have at least minNeighbours “neighbours”, you are called a “core point”
Describe the algorithm of density-based clustering (the process)
For each example xi:
- If xi is already assigned to a cluster, do nothing
- Test whether xi is a ‘core’ point (≥ minNeighbours examples within ‘ε’)
- If xi is not a core point, do nothing (this could be an outlier).
- if xi is a core point, “expand” cluster
“Expand” cluster function:
- Assign all xj within distance ‘ε’ of core point xi to cluster.
- For each newly-assigned neighbour xj that is a core point, “expand” cluster
What are some of the issues with density-based clustering?
Some points are not assigned to a cluster
- Good/bad depending on the application
Ambiguity of “non-core” (boundary) points between clusters
Consumes a lot of memory with large datasets
Sensitive to the choice of ε and miNeighbours
- Otherwise, not sensitive to initialization (except for boundary points)
What are the two ways of doing hierarchical clustering?
Hierarchical clustering can be split into the following two types of clustering:
- Divisive Clustering*
- Top-down hierarchical clustering where all start in one cluster and are then divided into smaller and smaller clusters
- Agglomerative Clustering*
- Hierarchical clustering using a bottom-up approach where each observation starts in its own cluster
In general, Agglomerative clustering works much better in practice
In Agglomerative clustering, clusters are successively merged…
- Using some linkage criterion
- and based on a distance metric
until all samples belong to one cluster
True or False? If uncertain whether scaling is required, I should scale my data
True, if you’re not sure whether scaling is needed, scale it.
Hierarchical clustering is often visually inspected using…
A dendrogram
Which is a tree diagram that shows the hierarchy and how the data is split into clusters
Which distance metrics are typically used in Agglomerative clustering?
Euclidean Distance
Manhattan (block) distance
Which different linkages (for hierarchical clustering) are there?
- Centroid
- Single(“nearest neighbour”)
- Complete(the “farthest neighbour”)
- Average
- Ward
What is a centroid linkage?
Centroid: The distance between the centroids of each cluster
What is a Single(“Nearest neighbour”) linkage?
Single(“nearest neighbour”): The shortest distance between two points in each cluster
What is a Complete(“Farthest neighbour”) linkage?
Complete(the “farthest neighbour”): The longest distance between two points in each cluster
What is an Average Linkage?
Average: The average distance between each two points in each two clusters
What is a Ward Linkage?
Ward: The sum of the squared distances from each point to the mean of the merged clusters
What are the issues with hierachical clustering?
Infeasible with very large datasets
Influenced by order of datapoints
Sensitive to outliers
it is impossible to undo a step in hierarchical clustering (i.e revert to the previous step)
What is the purpose of unsupervised learning?
As we do not have data labels, the purpose is to group datapoints that are similar, and find some patterns
What are some common scaling/normalization methods?
Rescaling (min-max normalization), Mean normalization, Standardization (Z-score Normalization)
What is the objective of K-means clustering? (can also be used as a definition of K-means)
Given a set of observations, the algorithm’s goal is to partition the n observations into K sets so as to minimize within-cluster sum of squares
Some extra info on silhouette analysis:
The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
True or False? DBSCAN is sensitive to hyperparameter setting of epsilon and MinNeighbours, and also to the initialization, as it first guesses the mean of the clusters.
False. DBSCAN is sensitive to hyperparameter setting epsilon and MinNeighbours, but it does not guess the means of the clusters in the beginning like K-means does, so it is not sensitive to initialization