Unsupervised Learning Flashcards
What are the key differences between supervised and unsupervised learning?
Supervised learning uses labeled data to predict outputs.
Unsupervised learning finds patterns or clusters in unlabeled data.
Explain how K-means clustering works, including both main steps of the algorithm.
First you need to choose K initial cluster centers (means). Then repeat these steps
1. Assign Points: Assign data points to the nearest cluster center.
2. Update Centers: Recalculate cluster centers based on the mean of assigned points.
What are the major advantages and disadvantages of K-means clustering mentioned in the slides?
Advantages: Easy to implement, fairly fast, and adaptable to incorrect initial settings.
Disadvantages: Heavily dependent on initial seed, doesn’t use meta-information, and struggles with non-spherical clusters.
Describe how DBSCAN differs from K-means clustering and why it might perform better on certain types of data.
DBSCAN identifies clusters based on density, handles noise, and works well with non-spherical clusters, unlike K-means, which assumes spherical clusters.
Explain what a ‘latent representation’ is.
A latent representation is a compressed version of data in a lower-dimensional space, capturing its key features.
What types of data can be used with K-means clustering according to the slides?
K-means works with any type of data that can be defined as a numerical distance metric.
Why might choosing initial means that are far apart be beneficial in K-means clustering?
Far-apart initial means reduce the chance of poor convergence by starting with distinct clusters.
How does the ‘bottleneck’ in an autoencoder contribute to dimensionality reduction?
The bottleneck forces data to compress into fewer dimensions, keeping only the most important features.
What are the different metrics that can be minimized in K-means clustering besides distance to center?
Maximum distance to a centroid
Sum of average distance to the centroids over all clusters
Sum of variance over clusters
Total distance between all points and their centroids
How does DBSCAN handle noise in the data compared to K-means clustering?
DBSCAN marks low-density points as noise, while K-means forces all points into clusters.
What are the main limitations of K-means clustering when dealing with non-spherical data?
K-means assumes clusters are spherical and struggles with irregular shapes or varying densities.
Explain why running multiple iterations with different starting means is recommended for K-means clustering.
Different starting means help avoid poor convergence and improve the chance of finding optimal clusters.