07_unsupervised learning Flashcards
What is the goal for unsupervised problems?
fins a transformation (T)
that builds a compact internal representation
of unlabeled data (x)
to unveil its internal structure
- we want a mathematical function that does something for you, model that builds a representation, way to represent data in an efficient way
What is the difference between supervised and unsupervised learning?
in unsupervised learning…
- no labels are required, whereas those are crucial in supervised learning
- no specific task is learned (no classification, regression etc)
- train/val/test splits are not as common, but can still be useful depending on your goals
(split data sets still allow you to evaluate how well your model generalizes to unseen data)
Why use unsupervised learning?
unsupervised learning unveils structure in the data, which we want to exploit to make better use of it
- clustering
- dimensionality reduction
What is the goal of clustering in unsupervised learning?
reveal agglomerates in the data set that might indicate populations of samples with distinct properties
What is the goal of dimensionality reduction in unsupervised learning?
reduce the number of dimensions in the data for better interpretability and/or better performance of supervised learning approaches
What is Clustering?
is the task of grouping a set of objects
in such a way that objects in the same group (called a cluster)
are more similar (in some sense)
to each other than to those in other groups (clusters)
- identify latent (not observable) classes inherent to the data
- identify taxonomies in the underlying data
- identify noise/outliers
What are different approaches to clustering? (4)
- centroid-based: clusters are represented as mean vectors
- connectivity-based: clusters are defined based on distance connectivity
- distribution-based: clusters are defined by statistical distributions
- density-based: clusters are high density regions
What is a centroid-based clustering method?
k-means clustering
clusters are represented by vectors that point to their centers (centroids)
What is k-means clustering?
similarly to nn-models and many other clustering methods, k-means clustering is based on a DISTANCE METRIC
eg. euclidian distance metric
k-means does not see labels in a dataset
What are the steps to implement k-means algorithm?
0) initialization: pick k random data points as cluster center locations
1) cluster assignment: compute distances between cluster ceonters and data points; assign each data point to closest cluster
2) recompute cluster centers: recompute cluster center locations from all cluster members
3) reiterate steps 1 and 2
4) termination: terminate algorithm if cluster assignments do not change anymore –> it self-organizes itself
What happens if cluster colors are mismatched with k-means results?
some data points are impossible to recover
–> k-means clusters resemble ground-truth labels very much, however it ignores mismatch in the clusters: k-means is unsupervised
What is the right value of k in k-means discussion?
-qualitative approach: visual inspection and human intuition (works well on low-dimensional and well-behaved data)
-quantitative approach: BIC (see hierarchical clustering)
What are limitations of k-means method?
- relying on a distance metric, k-means intrinsically expects radially symmetric (“gaussian”) clusters
- proper data scaling is crucial
What is a distribution-based clustering method?
expectation maximization clustering
What is a big disadvantage of k-means/agglomerative clustering?
hard cluster assignment
–> each data point is assigned to a single cluster, no information on likelihood
how to compute likelihoods for each data point to belong to individual clusters?
–> EM (expectation maximization) clustering