Unsupervised Learning (Cluster Analysis) Flashcards

Question 1

Q

What is the one line definition of Unsupervised Learning?

Answer

A

A type of algorithm that learns hidden patterns from unlabelled data

Question 2

Q

What are two primary aspects of Unsupervised Learning?

Answer

A

Cluster Analysis - Divide data into meaningful groups
Dimensionality Reduction - PCA, Autoencoders, Generative Models

Question 3

Q

What are the 4 key aspects of Cluster Analysis?

Answer

A

Features that describe the subjects
Similarity Functions
Basic Clustering Algorithms
Cluster Validation

Question 4

Q

What 4 common Similarity Functions are in Unsupervised Learning?

Answer

A

Euclidean Distance
Cosine Distance
Manhattan Distance
Jaccard Distance

Question 5

Q

What is the equation for Euclidean Distance?

Answer

A

Euclidean Distance = Square Root of ((A - B)^T * (A - B))
Where:
A, B: Column Vectors that contain features of two data samples

Question 6

Q

What is the equation for Cosine Distance?

Answer

A

Cosine Distance: cos(theta) = (A * B) / (||A|| * ||B||)
Where:
A, B: Column Vectors that contain features of two data samples

Question 7

Q

What is the equation for the Manhattan Distance?

Answer

A

Manhattan Distance = 1/n (n sigma) |A - B|
Where:
A, B: Column Vectors that contain features of two data samples

Question 8

Q

What is the equation for the Jaccard Distance?

Answer

A

Jaccard Distance = |A intersection B| / |A union B|
Where:
A, B: Column Vectors that contain features of two data samples

Question 9

Q

What is the step-by-step method of K-Means Algorithm?

Answer

A

Randomly select K points as the initial centroids (centres) for each of the K groups
Repeat this process:
a - Assign each point to its closest centroid (centre)
b - Re-compute the centroid (centre) of each cluster
Until centroids (centre) do not change

Question 10

Q

What are the advantages of K-Means Algorithm?

Answer

A

Simple
Efficient

Question 11

Q

What are the disadvantages of K-Means Algorithm?

Answer

A

Solution dependent on the initialisation

Need to specify number of clusters

Sensitive to Outliers

Question 12

Q

What is the step-by-step method of Agglomerative Hierarchical Clustering?

Answer

A

Treat each data as a cluster, and compute the similarity matrix between each pair of data
Repeat this process:
a - Merge the closest two clusters
b - Update the similarity matrix
Repeat the above until only one cluster remains

Question 13

Q

What are the advantages of Agglomerative Hierarchical Clustering?

Answer

A

Flexible with number of clusters

Can capture hierarchical relationship/s

Question 14

Q

What are the disadvantages of Agglomerative Hierarchical Clustering?

Answer

A

Solution is local optimum, dependent on subject functions e.g. minimum, maximum, group average, etc…
Requires larger memory and longer computational time

Question 15

Q

What is the step-by-step method for Density-Based Spatial Clustering of Applications with Noise (DBSCAN)?

Answer

A

Find the neighbourhood points of every point with distance E
Identify the core points with more than minimum number of points worth of neighbours
Connect the core points if they are within distance E
Make each group of connected core points into a separate cluster
Assign each non-core point to a nearby cluster if they are within distance R, called border points
Unassigned points are noise points

Question 16

Q

What are the advantages of DBSCAN?

Answer

Study These Flashcards

A

Robust to outliers

Learn non-regular population density patterns

Automatically determine the number of clusters

Question 17

Q

What are the disadvantages of DBSCAN?

Answer

Study These Flashcards

A

Not robust to variable density clusters (due to single E being used)

Computationally expensive

Sensitive to parameter settings

Question 18

Q

What is the step-by-step method for Expectation Maximisation?

Answer

Study These Flashcards

A

Assume a distribution model that describes the data e.g. Gaussian
Initialise the model parameters (mean and variance)
Repeat this process:
a - E step: Calculate the expected value (which class each data belongs to) of the log likelihood function with the current model parameters
b - M step: Estimate model parameters that maximise the log likelihood function
Until Convergence

Question 19

Q

What are the advantages of Expectation Maximisation?

Answer

Study These Flashcards

A

Soft clustering

Question 20

Q

What are the disadvantages of Expectation Maximisation?

Answer

Study These Flashcards

A

Restricted by the distribution model

Sensitive to initialisation

Need to specificity the number of clusters

Question 21

Q

What is a sign of a good Clustering Analysis?

Answer

Study These Flashcards

A

If the clustering algorithm separates dissimilar data samples apart and similar data samples together, then it has performed well.

Unsupervised Learning (Cluster Analysis) Flashcards

Notes from the Unsupervised Learning lecture that may help with the exam. (21 cards)