Unsupervised Learning Flashcards

1
Q

What is unsupervised learning?

A

Unsupervised learning is a type of machine learning where the training data does not contain any output information (i.e., unlabeled data). The goal is to find patterns and structures in the input data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is clustering in unsupervised learning?

A

Clustering is the process of grouping similar objects into clusters based on their characteristics. It is used to create a higher-level representation of the data and for tasks such as data reduction and outlier detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some common applications of unsupervised learning?

A

Social network analysis or marketing

Image segmentation

Data annotation (e.g., single-cell transcriptomics)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the goal of clustering algorithms?

A

Clustering algorithms aim to form groups such that members within a group are similar to each other but different from members of other groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are similarity measures in clustering?

A

Similarity measures define how close two instances are to each other. Examples include Euclidean distance, Manhattan distance, and cosine similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a cluster center?

A

A cluster center is a representative data point of a cluster. For numeric data, it is the “center of mass” (mean), while for nominal data, it is the mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are within-cluster and between-cluster variations?

A

Within-cluster variation (WC): Measures how compact the clusters are.

Between-cluster variation (BC): Measures the distances between different clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the k-means algorithm?

A

K-means is a partition-based clustering algorithm that follows these steps:

Define the number of clusters (k).

Choose k initial centroids randomly.

Assign each data object to the nearest centroid.

Compute new centroids as the mean of cluster members.

Repeat the process until cluster membership no longer changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are variations of the k-means algorithm?

A

Selection of the initial k means

Different dissimilarity calculations

Various strategies for calculating cluster means

Use of different distance measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the elbow method in k-means?

A

The elbow method helps determine the optimal number of clusters by plotting the within-cluster sum of squares (WCSS) against k. The ideal k is at the ‘elbow’ where WCSS decreases sharply.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the strengths of k-means clustering?

A

Simple and easy to implement

Computationally efficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the weaknesses of k-means clustering?

A

Requires predefining k

Sensitive to initialization

Sensitive to noise and outliers

Struggles with non-globular cluster shapes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is hierarchical clustering?

A

Hierarchical clustering builds a hierarchy of clusters by either merging (agglomerative) or splitting (divisive) data points based on similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is agglomerative clustering?

A

Agglomerative clustering starts with each data point as its own cluster and merges the closest clusters iteratively until only one cluster remains.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are different distance metrics in agglomerative clustering?

A

Single linkage: Distance between the closest points of two clusters

Complete linkage: Distance between the farthest points of two clusters

Centroid distance: Distance between cluster centroids

Group average: Average of all pairwise distances between clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a dendrogram?

A

A dendrogram is a tree-like diagram that visualizes the merging process in hierarchical clustering.

17
Q

What are the strengths of agglomerative clustering?

A

Produces deterministic results

Multiple possible cluster configurations

No need to predefine k

Can handle arbitrarily shaped clusters (single-linkage)

18
Q

What are the weaknesses of agglomerative clustering?

A

Computationally expensive for large datasets

Requires defining a distance metric

19
Q

What is the difference between partition-based and hierarchical clustering?

A

Partition-based clustering (e.g., k-means) requires a predefined number of clusters and assigns data points to clusters iteratively.

Hierarchical clustering builds a tree-like structure of clusters and does not require a predefined number of clusters.

20
Q

What is the purpose of clustering in data analysis?

A

Clustering helps with data exploration, pattern discovery, data compression, anomaly detection, and feature engineering for supervised learning models.