CHAP 8 : Unsupervised learning + k means clustering Flashcards
What are the types of unsupervised learning tasks? [4]
- Clustering – grouping of similar customer profiles
- Dimensionality reduction – finding key features from data
- Anormaly detection – detecting fraud transactions
** 4. Association rule mining – analyze data for patterns, or co-occurrences [e.g. people who buy a new home are most likely going to buy new furniture]
What is the aim of clustering?
To identify similar groups of data based on notion of similar patterns (shape, colour, size etc)
What measure is used to group data with similarities?
Similarity measures, such as distance metrics like euclidean distance (most commonly used)
What similarity measure is used in k means?
Squared Euclidean Distance
What are the 2 main steps in k-means algorithm?
- Cluster assignment step
- Move centroid step
What is K-Means algorithm?
it is a [popular] centroid-based clustering algorithm
List the steps of k means clustering in detail.
- Randomly initialise k-cluster centroids. (k = no of clusters)
- Cluster assignment step : assign data points to the closest cluster centre. (calculate
- Move the cluster centre (centroid) to the average of assigned points
- Stop when the algorithm convereges.
How do we know that the kmean algorithm converges and that we can stop running the algorithm? [2]
- There is no change in the point’s assignment.
- There is minimal change is the loss / objective function
What is the optimisation objective function in k means clustering? What is it called?
To minimise the sum of average squared Euclidean distances between each data point and its assigned cluster centroid. [aka within-cluster sum of squares (WCSS) or the sum of squared errors (SSE).]
Is objective function same as error function / cost function?
YES.
The function we want to minimize or maximize is called the objective function, or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function.
What does an optimal k means clustering look like?
- Have clusters of around equal sizes, that are spherical and are non-overlapping
- For each cluster, n > k, where k = no of cluster centres and n is the no of data points in the assigned cluster
What are the 2 methods we can use to choose k, the number of clusters?
- Elbow method, by plotting a graph of loss/ error function against the k, the number of clusters. [When we see an elbow shape in the graph, we pick the K-value where the elbow gets created, also known as the elbow point]
- Choose manually, k under situations where allowed. For example, shirt size only has 4 distinct sizes : S,M,L,XL. Thus k = 4