CHAP 8 : Unsupervised learning + k means clustering Flashcards

Question 1

Q

What are the types of unsupervised learning tasks? [4]

Answer

A

Clustering – grouping of similar customer profiles
Dimensionality reduction – finding key features from data
Anormaly detection – detecting fraud transactions

** 4. Association rule mining – analyze data for patterns, or co-occurrences [e.g. people who buy a new home are most likely going to buy new furniture]

Question 2

Q

What is the aim of clustering?

Answer

A

To identify similar groups of data based on notion of similar patterns (shape, colour, size etc)

Question 3

Q

What measure is used to group data with similarities?

Answer

A

Similarity measures, such as distance metrics like euclidean distance (most commonly used)

Question 4

Q

What similarity measure is used in k means?

Answer

A

Squared Euclidean Distance

Question 5

Q

What are the 2 main steps in k-means algorithm?

Answer

A

Cluster assignment step
Move centroid step

Question 6

Q

What is K-Means algorithm?

Answer

A

it is a [popular] centroid-based clustering algorithm

Question 7

Q

List the steps of k means clustering in detail.

Answer

A

Randomly initialise k-cluster centroids. (k = no of clusters)
Cluster assignment step : assign data points to the closest cluster centre. (calculate
Move the cluster centre (centroid) to the average of assigned points
Stop when the algorithm convereges.

Question 8

Q

How do we know that the kmean algorithm converges and that we can stop running the algorithm? [2]

Answer

A

There is no change in the point’s assignment.
There is minimal change is the loss / objective function

Question 9

Q

What is the optimisation objective function in k means clustering? What is it called?

Answer

A

To minimise the sum of average squared Euclidean distances between each data point and its assigned cluster centroid. [aka within-cluster sum of squares (WCSS) or the sum of squared errors (SSE).]

Question 10

Q

Is objective function same as error function / cost function?

Answer

A

YES.
The function we want to minimize or maximize is called the objective function, or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function.

Question 11

Q

What does an optimal k means clustering look like?

Answer

A

Have clusters of around equal sizes, that are spherical and are non-overlapping
For each cluster, n > k, where k = no of cluster centres and n is the no of data points in the assigned cluster

Question 12

Q

What are the 2 methods we can use to choose k, the number of clusters?

Answer

A

Elbow method, by plotting a graph of loss/ error function against the k, the number of clusters. [When we see an elbow shape in the graph, we pick the K-value where the elbow gets created, also known as the elbow point]
Choose manually, k under situations where allowed. For example, shirt size only has 4 distinct sizes : S,M,L,XL. Thus k = 4