lecture 11: K-means clustering Flashcards
recap: k-means clustering is a type of what learning
unsupervised learning, which uses unlabelled data
what is the goal of an unsupervised learning algorithm?
to take feature vector x as input and transform it into another vector or value that can be used to solve a practical problem
the absence of labels on the data means
the absence of a solid reference point to judge the quality of the model
what are the main approaches for unsupervised learning
clustering, density estimation, component analysis, neural networks
what does the k in k-means clustering represent
it is the number of clusters we want to identify within the data, and also the number of distinct data points that will be selected randomly initially, called centroids
what is the next step after randomly selecting the k initial data points or centroids?
for each other point in the data set, measure the distance(or some other metric) between that point and each of the centroids. we assign each data point or example to the cluster with the closest centroid
what is the next iteration after the initial assignment to the clusters
for each centroid, we calculate the average feature vector of all the examples assigned to its cluster, then that average vector becomes the new centroid.
we recompute the distances for each example to the new centroids and modify the assignments if necessary
how do we decide that the clusters are final
when the assignments no longer change after the centroids have been recomputed
how do we decide the value of k in the first place
use an elbow plot, find the most dramatic change in variance when k is increased
what are the 2 methods for the initialisation step of a k-means clustering algorithm?
random partition - randomly assign each point to a cluster and then compute the initial means to be the first centroids
forgy method - randomly choose k points from the data set to be the initial centroids
what is the difference between hard and soft clustering
hard - each data point can only belong to one cluster
soft(fuzzy) - each data point can belong to more than one cluster