Clustering Flashcards
What are some applications of clustering?
Image segmentation
Social network analysis
Bioinformatics
What is clustering?
The process of grouping a set of instances into classes of similar instances
What are the two types of clustering algorithms?
Partitional algorithms
Hierarchical algorithms
What are some considerations we need to make for clustering?
How to measure similarity between two things
What approach to take
How many clusters
What is a centroid?
A point that is considered to be the centre of a cluster
What are the steps for the k-means algorithm?
- Compute dist from all data to all centroids
- For each data point, assign it to whichever centroid is closer
- for each centroid, compute the mean of all points assigned to it
- Replace the centroids with the new averages
Can the final result of k-means clustering be affected by the initial centroid choice?
Yes - some can result in a poor convergence rate, or convergence to sub optimal clusterings
What is the time complexity of k-means clustering?
O(iknd)
i - iterations
k - num of centroids
n - num of data points
d - number of features of the data points
What are the limitations of k-means clustering?
- Must choose param k in advance
- Data must be numerical and must be able to be compared via a suitable measure
- Algorithm sensitive to outliers that do not belong in a cluster
- works best on data with spherical clusters
How do we measure cluster validity?
- High inter (between) cluster distances
- Low intra (within) cluster distances
What are the two kinds of hierarchical clustering algorithms?
- Agglomerative (bottom up)
- Divisive (top down)
How does HAC work?
- Start with each point being a cluster
- Merge the closest points
- Eventually all points belong to the same cluster
How does HDC work?
- Start with all points belonging to one cluster
- Split furthest points
- Eventually each node is an individual cluster
What are the steps for HAC?
Compute the distance matrix (= distance between any 2 data points)
Let each data point be a cluster
Repeat:
Merge the two (or more) closest clusters
Update the distance matrix
Until only a single cluster remains
What are the different ways to define closest clusters?
- Single link : dist of closest
- Complete link : dist of furthest
- Centroid : dist of cog
- Average link : average dist of data points