Unsupervised Learning (Cluster Analysis) Flashcards
Notes from the Unsupervised Learning lecture that may help with the exam.
What is the one line definition of Unsupervised Learning?
A type of algorithm that learns hidden patterns from unlabelled data
What are two primary aspects of Unsupervised Learning?
Cluster Analysis - Divide data into meaningful groups
Dimensionality Reduction - PCA, Autoencoders, Generative Models
What are the 4 key aspects of Cluster Analysis?
- Features that describe the subjects
- Similarity Functions
- Basic Clustering Algorithms
- Cluster Validation
What 4 common Similarity Functions are in Unsupervised Learning?
Euclidean Distance
Cosine Distance
Manhattan Distance
Jaccard Distance
What is the equation for Euclidean Distance?
Euclidean Distance = Square Root of ((A - B)^T * (A - B))
Where:
A, B: Column Vectors that contain features of two data samples
What is the equation for Cosine Distance?
Cosine Distance: cos(theta) = (A * B) / (||A|| * ||B||)
Where:
A, B: Column Vectors that contain features of two data samples
What is the equation for the Manhattan Distance?
Manhattan Distance = 1/n (n sigma) |A - B|
Where:
A, B: Column Vectors that contain features of two data samples
What is the equation for the Jaccard Distance?
Jaccard Distance = |A intersection B| / |A union B|
Where:
A, B: Column Vectors that contain features of two data samples
What is the step-by-step method of K-Means Algorithm?
- Randomly select K points as the initial centroids (centres) for each of the K groups
- Repeat this process:
a - Assign each point to its closest centroid (centre)
b - Re-compute the centroid (centre) of each cluster - Until centroids (centre) do not change
What are the advantages of K-Means Algorithm?
Simple
Efficient
What are the disadvantages of K-Means Algorithm?
Solution dependent on the initialisation
Need to specify number of clusters
Sensitive to Outliers
What is the step-by-step method of Agglomerative Hierarchical Clustering?
- Treat each data as a cluster, and compute the similarity matrix between each pair of data
- Repeat this process:
a - Merge the closest two clusters
b - Update the similarity matrix - Repeat the above until only one cluster remains
What are the advantages of Agglomerative Hierarchical Clustering?
Flexible with number of clusters
Can capture hierarchical relationship/s
What are the disadvantages of Agglomerative Hierarchical Clustering?
Solution is local optimum, dependent on subject functions e.g. minimum, maximum, group average, etc…
Requires larger memory and longer computational time
What is the step-by-step method for Density-Based Spatial Clustering of Applications with Noise (DBSCAN)?
- Find the neighbourhood points of every point with distance E
- Identify the core points with more than minimum number of points worth of neighbours
- Connect the core points if they are within distance E
- Make each group of connected core points into a separate cluster
- Assign each non-core point to a nearby cluster if they are within distance R, called border points
- Unassigned points are noise points