Lecture 3 - Unsupervised Machine Learning Flashcards by Simon Sardorf

What are some examples of unsupervised machine learning?

outlier detection
similarity search
association rules
data visualization
clustering

How well did you know this?

Not at all

Perfectly

Describe Clustering. What inputs does it take? What is the output?

Clustering is a way of grouping data into a number of clusters without having labels present

Input: Set of objects described by features xi

Output: An assignment of objects into “groups”

Unlike classification, we are not given the “groups”. The algorithm must figure these groups out

How well did you know this?

Not at all

Perfectly

Can you give some examples of use cases for clustering?

define market segments by clustering customers
study social networks by recognizing communities
recommendation systems (Amazon recommending products, Netflix recommending shows)

How well did you know this?

Not at all

Perfectly

How do you normalize/scale data?

You can either

Scale data from 0-1
Normalise using the Z-score (x’=x-μ)/σ): Transform the data so that it is expressed as σ from the mean

How well did you know this?

Not at all

Perfectly

What is K-Means Clustering? What is the input of the algorithm? What are the assumptions? Describe the 4 steps in the algorithm.

K Means clustering is one of the most popular clustering methods

Input:
- The number of clusters ‘k’ (hyperparameter)

Assumptions:

The center of each cluster is the mean of all samples belonging to that cluster
Each sample is closer to the center of its own cluster than to the center of other clusters

The four steps are like so:

Initial guess of the center (the “mean”) of each cluster
Assign each xi to its closest mean
Update the means based on the cluster assignments
Repeat steps 2-3 until convergence

How well did you know this?

Not at all

Perfectly

What are the assumptions of K-Means clustering?

The center of each cluster is the mean of all samples belonging to that cluster

Each sample is closer to the center of its own cluster than to centers of other clusters

How well did you know this?

Not at all

Perfectly

How can you relate K-Means clustering to set theory?

We can interpret K-Means steps as trying to minimize an objective:
Given a set of observations (x1,x2,…,xn) the algorithm’s goal is to partition the n observations into k sets S={S1,S2,…,Sk} so as to minimize the within-cluster sum of squares:

{See the rest of the math in Notion}

How well did you know this?

Not at all

Perfectly

How can you determine how many K’s in K-Means clustering?

You can determine how many clusters using:

Elbow Method
Silhouette analysis

How well did you know this?

Not at all

Perfectly

What is the Elbow Method?

Elbow Method:

Run K-means for several k
Distortion: Sum of distances of each point to the center of the closest cluster
Look for k where the curve stops decreasing rapidly

How well did you know this?

Not at all

Perfectly

What is silhouette analysis?

Thickness of the plot shows the size of the cluster (how many datapoints are assigned to the cluster)

The groups in the graph should be approximately similar in terms of the sihouette coefficient, they should not be under the mean of the s. coeff., and hopefully they would also be approximately of the same thickness (unless you can clearly see the diff. between the clusters that they really differ in terms of size)

How well did you know this?

Not at all

Perfectly

What are some issues with K-Means clustering?

Final cluster assignment depends on initialization of centers

Cluster assignments may vary on different runs
May not achieve global optimum

Assumes you know the number of clusters ‘k’
- Lots of heuristic approaches to picking ‘k’

Each object is assigned to one (and only one) cluster:

No possibility for overlapping clusters or leaving objects unassigned
Fuzzy clustering/soft k-means allows assigning to many

Sensitive to scale

How well did you know this?

Not at all

Perfectly

When is a set convex?

A set is convex if a line between to points in the set stays in the set (See images on Notion)

How well did you know this?

Not at all

Perfectly

Can K-Means cluster into non-convex sets?

No, K-Means clusters cannot

How well did you know this?

Not at all

Perfectly

What is Density based Clustering

Clusters are defined by “dense” regions
It’s deterministic, meaning that it always gives the same clusters
No fixed number of clusters ‘k’, determines them by itself
Objects in non-dense regions don’t get clustered
i. e not trying to “partition” the space
Clusters can be non-convex, i.e you can find clusters of any shape

How well did you know this?

Not at all

Perfectly

What is DBSCAN? Which hyperparameters does it have?

DBSCAN is a density based clustering algorithm.

It has two hyperparameters:
- Epsilon(ε): Distance we use to decide if another point is a “neighbour”.

MinNeighbours: Number of neighbours needed to say a region is “dense”
If you have at least minNeighbours “neighbours”, you are called a “core point”

How well did you know this?

Not at all

Perfectly

Describe the algorithm of density-based clustering (the process)

Study These Flashcards

For each example xi:

If xi is already assigned to a cluster, do nothing
Test whether xi is a ‘core’ point (≥ minNeighbours examples within ‘ε’)
- If xi is not a core point, do nothing (this could be an outlier).
- if xi is a core point, “expand” cluster

“Expand” cluster function:

Assign all xj within distance ‘ε’ of core point xi to cluster.
For each newly-assigned neighbour xj that is a core point, “expand” cluster

What are some of the issues with density-based clustering?

Study These Flashcards

Some points are not assigned to a cluster
- Good/bad depending on the application

Ambiguity of “non-core” (boundary) points between clusters

Consumes a lot of memory with large datasets

Sensitive to the choice of ε and miNeighbours
- Otherwise, not sensitive to initialization (except for boundary points)

What are the two ways of doing hierarchical clustering?

Study These Flashcards

Hierarchical clustering can be split into the following two types of clustering:

Divisive Clustering*
Top-down hierarchical clustering where all start in one cluster and are then divided into smaller and smaller clusters
Agglomerative Clustering*
Hierarchical clustering using a bottom-up approach where each observation starts in its own cluster

In general, Agglomerative clustering works much better in practice

In Agglomerative clustering, clusters are successively merged…

Study These Flashcards

Using some linkage criterion
and based on a distance metric

until all samples belong to one cluster

True or False? If uncertain whether scaling is required, I should scale my data

Study These Flashcards

True, if you’re not sure whether scaling is needed, scale it.

Hierarchical clustering is often visually inspected using…

Study These Flashcards

A dendrogram

Which is a tree diagram that shows the hierarchy and how the data is split into clusters

Which distance metrics are typically used in Agglomerative clustering?

Study These Flashcards

Euclidean Distance

Manhattan (block) distance

Which different linkages (for hierarchical clustering) are there?

Study These Flashcards

Centroid
Single(“nearest neighbour”)
Complete(the “farthest neighbour”)
Average
Ward

What is a centroid linkage?

Study These Flashcards

Centroid: The distance between the centroids of each cluster

What is a Single("Nearest neighbour") linkage?

Single("nearest neighbour"): The shortest distance between two points in each cluster

What is a Complete("Farthest neighbour") linkage?

Complete(the "farthest neighbour"): The longest distance between two points in each cluster

What is an Average Linkage?

Average: The average distance between each two points in each two clusters

What is a Ward Linkage?

Ward: The sum of the squared distances from each point to the mean of the merged clusters

What are the issues with hierachical clustering?

Infeasible with very large datasets Influenced by order of datapoints Sensitive to outliers it is impossible to undo a step in hierarchical clustering (i.e revert to the previous step)

What is the purpose of unsupervised learning?

As we do not have data labels, the purpose is to group datapoints that are similar, and find some patterns

What are some common scaling/normalization methods?

Rescaling (min-max normalization), Mean normalization, Standardization (Z-score Normalization)

What is the objective of K-means clustering? (can also be used as a definition of K-means)

Given a set of observations, the algorithm's goal is to partition the n observations into K sets so as to minimize within-cluster sum of squares

Some extra info on silhouette analysis:

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

True or False? DBSCAN is sensitive to hyperparameter setting of epsilon and MinNeighbours, and also to the initialization, as it first guesses the mean of the clusters.

False. DBSCAN is sensitive to hyperparameter setting epsilon and MinNeighbours, but it does not guess the means of the clusters in the beginning like K-means does, so it is not sensitive to initialization

Lecture 3 - Unsupervised Machine Learning Flashcards

(34 cards)