Unsupervised Learning Flashcards
What is unsupervised learning?
Unsupervised learning is a type of machine learning where the training data does not contain any output information (i.e., unlabeled data). The goal is to find patterns and structures in the input data.
What is clustering in unsupervised learning?
Clustering is the process of grouping similar objects into clusters based on their characteristics. It is used to create a higher-level representation of the data and for tasks such as data reduction and outlier detection.
What are some common applications of unsupervised learning?
Social network analysis or marketing
Image segmentation
Data annotation (e.g., single-cell transcriptomics)
What is the goal of clustering algorithms?
Clustering algorithms aim to form groups such that members within a group are similar to each other but different from members of other groups.
What are similarity measures in clustering?
Similarity measures define how close two instances are to each other. Examples include Euclidean distance, Manhattan distance, and cosine similarity.
What is a cluster center?
A cluster center is a representative data point of a cluster. For numeric data, it is the “center of mass” (mean), while for nominal data, it is the mode.
What are within-cluster and between-cluster variations?
Within-cluster variation (WC): Measures how compact the clusters are.
Between-cluster variation (BC): Measures the distances between different clusters.
What is the k-means algorithm?
K-means is a partition-based clustering algorithm that follows these steps:
Define the number of clusters (k).
Choose k initial centroids randomly.
Assign each data object to the nearest centroid.
Compute new centroids as the mean of cluster members.
Repeat the process until cluster membership no longer changes.
What are variations of the k-means algorithm?
Selection of the initial k means
Different dissimilarity calculations
Various strategies for calculating cluster means
Use of different distance measures
What is the elbow method in k-means?
The elbow method helps determine the optimal number of clusters by plotting the within-cluster sum of squares (WCSS) against k. The ideal k is at the ‘elbow’ where WCSS decreases sharply.
What are the strengths of k-means clustering?
Simple and easy to implement
Computationally efficient
What are the weaknesses of k-means clustering?
Requires predefining k
Sensitive to initialization
Sensitive to noise and outliers
Struggles with non-globular cluster shapes
What is hierarchical clustering?
Hierarchical clustering builds a hierarchy of clusters by either merging (agglomerative) or splitting (divisive) data points based on similarity.
What is agglomerative clustering?
Agglomerative clustering starts with each data point as its own cluster and merges the closest clusters iteratively until only one cluster remains.
What are different distance metrics in agglomerative clustering?
Single linkage: Distance between the closest points of two clusters
Complete linkage: Distance between the farthest points of two clusters
Centroid distance: Distance between cluster centroids
Group average: Average of all pairwise distances between clusters
What is a dendrogram?
A dendrogram is a tree-like diagram that visualizes the merging process in hierarchical clustering.
What are the strengths of agglomerative clustering?
Produces deterministic results
Multiple possible cluster configurations
No need to predefine k
Can handle arbitrarily shaped clusters (single-linkage)
What are the weaknesses of agglomerative clustering?
Computationally expensive for large datasets
Requires defining a distance metric
What is the difference between partition-based and hierarchical clustering?
Partition-based clustering (e.g., k-means) requires a predefined number of clusters and assigns data points to clusters iteratively.
Hierarchical clustering builds a tree-like structure of clusters and does not require a predefined number of clusters.
What is the purpose of clustering in data analysis?
Clustering helps with data exploration, pattern discovery, data compression, anomaly detection, and feature engineering for supervised learning models.