Section 7 Clustering K-means Flashcards
Define clustering
refers to a broad collection of unsupervised learning methods for finding groups of units in the data
Units belonging to the same cluster are more similar to each other than units belonging to different clusters.
Lots of approaches to clustering - we will focus on K-means
What are the parameters of k-means
The k-means clustering algorithm depends on two parameters:
K – The number of clusters.
μk – Cluster centroids
What could a cluster represent?
Cluster could be as a result of anything, a colour, a species separation, a behavioural separation, but with unsupervised learning we don’t know what this cause is.
In our data we will be using we will often have the output
In reality you don’t know this.
Explain how k means works in practice
Number of clusters K is an input
K means algorithm has two steps allocation and update.
Algorithm is initialized with randomly chosen cluster centroids.
Allocation: assign each observation to closest cluster centroid
Update: update centroids computing the mean of the points assigned to each cluster.
Converges until output is a vector of cluster allocations
What does convergence of k means algorithm mean
This repeats until the algorithm converges and reaches a stable configuration.
Basically, process is repeated until points are no longer moved between groups
Explain Euclidean distance and its relevance to k means
The Euclidean distance between a pair of points (xi, xh) is the straight line distance between the points
K-means involves thinking about the dissimilarity between points in terms of distance
Why do we used the squared Euclidean distance
Were examining the relative distance between points.
The squared Euclidean distance is often used in optimization problems as square roots are hard to deal with.
What is Z
Zi is the vector of length k where in the position of the cluster the observation is allocated to there is a one, everywhere else there is a zero.
What is the objective function fo the k means algorithm
The objective function uses the squared Euclidean distance, and it corresponds to the total within-cluster sum of squares, W .
The objective is to find the clustering that minimizes this sum of squares ie the objective function.
How is k means optimisation similar to ANOVA - whats the difference
This is similar to ANOVA concept except here we don’t know the grouping in advance, with ANOVA you do.
What is the allocation step of the iterative procedure of k means
Find the allocation values z(t)ik that minimise the objective function, holding the centroid values fixed, μk(t−1)
The value of k that gives the smallest value of the square function dictates which Z value is equal to 1.
What is the updating step of the iterative procedure of k means
Find the centroid values μk(t) that minimise the objective function, holding the allocation values fixed, zik(t)
What is the centroid
Centroid is the mean of observations in a given cluster.
What is one assumption heavily relied upon for k means and how is this rectified
Algorithm is started from random values for the centroids.
Different starting points could lead to different sub-optimal solutions.
Better run the algorithm several times with different starting points as results heavily rely upon this assumption.
Keep the solution with the minimum total within the sum of squares, as the best overall.
Explain internal and external validation
Internal validation – Measures of internal cluster consistency: can be useful to select the number of clusters
External validation – Comparing clustering to external reference clustering: useful to assess the quality of a given clustering
What two points should k value in k means algorithm highlight
Data points belonging to the same cluster should be similar to each other and close to the centroid, points belonging to different clusters should be dissimilar.
Name 4 methods of internal validation to quantify the characteristics and suitability of K
Elbow method - not very accurate
Calinski-Harabasz index.
Silhouette.
Gap statistic.
What is Wk
The W(K) or total within sum of squares, measures the variability “within”, that is the variability between the data points xi assigned to cluster k and the corresponding centroid μk.
What is Bk
The B(K), or the between sum of squares, measures the variability “between”, that is the variability between the clusters accounting for the centre of the data.
What is total sum of squares
Sum over all clusters of Wk+Bk