07_unsupervised learning Flashcards

1
Q

What is the goal for unsupervised problems?

A

fins a transformation (T)
that builds a compact internal representation
of unlabeled data (x)
to unveil its internal structure

  • we want a mathematical function that does something for you, model that builds a representation, way to represent data in an efficient way
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between supervised and unsupervised learning?

A

in unsupervised learning…

  • no labels are required, whereas those are crucial in supervised learning
  • no specific task is learned (no classification, regression etc)
  • train/val/test splits are not as common, but can still be useful depending on your goals

(split data sets still allow you to evaluate how well your model generalizes to unseen data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why use unsupervised learning?

A

unsupervised learning unveils structure in the data, which we want to exploit to make better use of it

  • clustering
  • dimensionality reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the goal of clustering in unsupervised learning?

A

reveal agglomerates in the data set that might indicate populations of samples with distinct properties

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the goal of dimensionality reduction in unsupervised learning?

A

reduce the number of dimensions in the data for better interpretability and/or better performance of supervised learning approaches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Clustering?

A

is the task of grouping a set of objects
in such a way that objects in the same group (called a cluster)
are more similar (in some sense)
to each other than to those in other groups (clusters)

  • identify latent (not observable) classes inherent to the data
  • identify taxonomies in the underlying data
  • identify noise/outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are different approaches to clustering? (4)

A
  • centroid-based: clusters are represented as mean vectors
  • connectivity-based: clusters are defined based on distance connectivity
  • distribution-based: clusters are defined by statistical distributions
  • density-based: clusters are high density regions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a centroid-based clustering method?

A

k-means clustering

clusters are represented by vectors that point to their centers (centroids)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is k-means clustering?

A

similarly to nn-models and many other clustering methods, k-means clustering is based on a DISTANCE METRIC
eg. euclidian distance metric

k-means does not see labels in a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the steps to implement k-means algorithm?

A

0) initialization: pick k random data points as cluster center locations

1) cluster assignment: compute distances between cluster ceonters and data points; assign each data point to closest cluster

2) recompute cluster centers: recompute cluster center locations from all cluster members

3) reiterate steps 1 and 2

4) termination: terminate algorithm if cluster assignments do not change anymore –> it self-organizes itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What happens if cluster colors are mismatched with k-means results?

A

some data points are impossible to recover

–> k-means clusters resemble ground-truth labels very much, however it ignores mismatch in the clusters: k-means is unsupervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the right value of k in k-means discussion?

A

-qualitative approach: visual inspection and human intuition (works well on low-dimensional and well-behaved data)

-quantitative approach: BIC (see hierarchical clustering)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are limitations of k-means method?

A
  • relying on a distance metric, k-means intrinsically expects radially symmetric (“gaussian”) clusters
  • proper data scaling is crucial
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a distribution-based clustering method?

A

expectation maximization clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a big disadvantage of k-means/agglomerative clustering?

A

hard cluster assignment

–> each data point is assigned to a single cluster, no information on likelihood

how to compute likelihoods for each data point to belong to individual clusters?
–> EM (expectation maximization) clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can we compute the likelihoods for each data point to belong to an individual cluster?

A

EM clustering
(expectation maximization clustering)

17
Q

What is expectation maximization clustering?

A

class affiliation probability for each of k clusters can be approximated by a Gaussian N (x; µi, Summe i) ;
EM Clustering is therefore a parametric clustering method

probability for data point x to belong to cluster i?
Intuition: if x-µi small, probability that it belongs to cluster i is large
if x-µi large, probability that it belongs to cluster i is small

we treat the data distribution as a mixture of Gaussians (Gaussian mixture model)

18
Q

What are the steps to implement expectation maximization clustering?

A

0) initialize parameters: pick random data points for centroids µi, and adopt default values for covariances Summe i

1) “expectation step”: for each data point j calculate probability to belong to cluster i

2) “maximization step”: recalculate model parameters to better fit the previously derived probability distributions (maximize probability)
–> changes the centroids and also the shape of the clusters

3) reiterate steps 1 and 2

4) termination: terminate algorithm if cluster assignments do not change anymore

19
Q

What are the differences on the assumptions on the data from k-means clustering and expectation maximization clustering?

A

k-means:
distribution of radially symmetric (gaussian) clusters

EM:
distributions of elongated (multi-normal) clusters

20
Q

How do k-means and Expectations maximization clustering relate?

A

similar algorithm, iterative two-step process for fitting of model
(k-means is actually a special case of EM)

with K-means, covariants are fixed and you only change the centroids

21
Q

What is the difference on cluster assignments between k-means and expectation maximization clustering?

A

k-means:
Hart Assignment (each data point is assigned to one cluster)

EM:
Soft assignment (each data point has a probability associated to each cluster)

22
Q

What is a connectivity-based clustering method?

A

hierarchical clustering

23
Q

What is hierarchical clustering?

A

builds a hierarchy of clusters based on the provided data set. the number of found clusters, k, varies in each iteration and can be considered its hyperparameter.

hierarchical clustering is non-parametric

assumes that data point that are close to each other are more likely to be part of the same cluster

24
Q

What are two different hierarchical- clustering approaches that work very similarly?

A
  • agglomerative clustering (bottom-up)
    –> clusters are merged based on a distance function
  • divisive clustering (top-down)
    –> clusters are split based on distance function
25
Q

What are steps to agglomerative clustering?

A

0) initialization: each data point forms its own cluster

1) merge closest neighbors: find closest neighbors and merge them (euclidean distance metric with “single-linkage” criteria)

2) merge closest neighbors: find closest neighbors and merge them, if neighbor is already merges in the same step, simply omit)

3) merge closest neighbors: find closest neighbors and merge them (same as step 2)

4) terminate: all data points form a single cluster

–> forms a dendrogram, “family tree”

at different iterations we find different numbers of clusters

26
Q

What is the “best-fit” number of clusters k for agglomerative clustering?

A
  • qualitative approach: visual inspection and human intuition (works well for low-dimensional and well-behaved data)
  • quantitative approach: Bayes Information Criterion (BIC)
27
Q

What is BIC?

A

Bayes Information Criterion

BIC = number of model parameters + ln(number of data points) - 2ln(maximum likelihood of model describing data)

Idea: minimize BIX to find the least complex model that fits the data well

28
Q

What is an advantage of agglomerative clustering?

A

much better able to deal with non-Gaussian clusters

29
Q

What is a disadvantage of agglomerative clustering?

A

data points belonging to the same cluster might be assigned to different clusters if highly elongated

30
Q

What is a density-based clustering method?

A

DBSCAN

Density-based spatial clustering of applications with noise

31
Q

What does DBSCAN stand for?

A

Density-based spatial clustering of applications with noise

32
Q

What is DBSCAN?

A

define clusters based on local over-densities in the data

supports the notion of noise: data points that are not part of any cluster will be considered noise

uses two hyperparameters:
- the distance within which to define a neighborhood, epsilon (radius)
- the number of data points per neighborhood to consider it a cluster

N = minimum number of data points per neighborhood to form a cluster

33
Q

What are steps to the DBSCAN algorithm?

A

0) initialization: pick a random data point and place an epsilon-neighborhood around it

1) find number of neighbors: check how many neighbors the data point ahs within the epsilon-neighborhood (if not enough neighbors, data point is considered noise)

2) create a new cluster: and assign it to the current data point. repeat this step with same cluster label for all data points in the neighborhood

3) reiterate steps 1 and 2: data points are either noise or there is a new neighborhood to create

4) terminate? (assumed: when all data points are categorized)

34
Q

How sensitive are the results of DBSCAN to the set of hyperparameters?

A

high sensitivity to epsilon:

small e: clusters are too small
large e: clusters are too big

–> needs a considerable amount of fine-tuning

35
Q

What are advantages of DBSCAN? (2)

A
  • often provides clearly delineated clusters of different shapes, if the hyperparameters were chosen properly
  • ability to identify noise
36
Q

What is a dimensionality reduction method?

A

principal components analysis

37
Q

What is dimensionality reduction?

A

helps to reduce the number of dimensions (features) in the data set

closely related to feature selection: provides a means to identify the important features in the data set

38
Q

What is principal component analysis (PCA)?

A

fits a multinormal ellipsoid to the data set. each axis of the ellipsoid is called a PRINCIPAL COMPONENT of the data set and its length relates to the variance along that axis (the more important a principal component, the longer it is)
–> important = descriptive for the data set

each principal component is a linear combination of feature vectors –> orthonormal basis for the data set

to achieve dimensionality reduction, only a subset of all principal components are utilized to represent data

allows you to transform (rotate) a data set into a more meaningful representation

39
Q

How is PCA done? (principal component analysis)

A

mathematically, principal components are EIGENVECTORS of the data set’s covariance matrix

–> eigendecomposition or singular value decomposition

required the underlying data to be scaled properly (zero mean, unity variance) –> otherwise mean offset strongly affects the principal components: the first component will most likely point to the mean of the data set