07_unsupervised learning Flashcards by Annina Vietze

What is the goal for unsupervised problems?

fins a transformation (T)
that builds a compact internal representation
of unlabeled data (x)
to unveil its internal structure

we want a mathematical function that does something for you, model that builds a representation, way to represent data in an efficient way

How well did you know this?

Not at all

Perfectly

What is the difference between supervised and unsupervised learning?

in unsupervised learning…

no labels are required, whereas those are crucial in supervised learning
no specific task is learned (no classification, regression etc)
train/val/test splits are not as common, but can still be useful depending on your goals

(split data sets still allow you to evaluate how well your model generalizes to unseen data)

How well did you know this?

Not at all

Perfectly

Why use unsupervised learning?

unsupervised learning unveils structure in the data, which we want to exploit to make better use of it

clustering
dimensionality reduction

How well did you know this?

Not at all

Perfectly

What is the goal of clustering in unsupervised learning?

reveal agglomerates in the data set that might indicate populations of samples with distinct properties

How well did you know this?

Not at all

Perfectly

What is the goal of dimensionality reduction in unsupervised learning?

reduce the number of dimensions in the data for better interpretability and/or better performance of supervised learning approaches

How well did you know this?

Not at all

Perfectly

What is Clustering?

is the task of grouping a set of objects
in such a way that objects in the same group (called a cluster)
are more similar (in some sense)
to each other than to those in other groups (clusters)

identify latent (not observable) classes inherent to the data
identify taxonomies in the underlying data
identify noise/outliers

How well did you know this?

Not at all

Perfectly

What are different approaches to clustering? (4)

centroid-based: clusters are represented as mean vectors
connectivity-based: clusters are defined based on distance connectivity
distribution-based: clusters are defined by statistical distributions
density-based: clusters are high density regions

How well did you know this?

Not at all

Perfectly

What is a centroid-based clustering method?

k-means clustering

clusters are represented by vectors that point to their centers (centroids)

How well did you know this?

Not at all

Perfectly

What is k-means clustering?

similarly to nn-models and many other clustering methods, k-means clustering is based on a DISTANCE METRIC
eg. euclidian distance metric

k-means does not see labels in a dataset

How well did you know this?

Not at all

Perfectly

What are the steps to implement k-means algorithm?

0) initialization: pick k random data points as cluster center locations

1) cluster assignment: compute distances between cluster ceonters and data points; assign each data point to closest cluster

2) recompute cluster centers: recompute cluster center locations from all cluster members

3) reiterate steps 1 and 2

4) termination: terminate algorithm if cluster assignments do not change anymore –> it self-organizes itself

How well did you know this?

Not at all

Perfectly

What happens if cluster colors are mismatched with k-means results?

some data points are impossible to recover

–> k-means clusters resemble ground-truth labels very much, however it ignores mismatch in the clusters: k-means is unsupervised

How well did you know this?

Not at all

Perfectly

What is the right value of k in k-means discussion?

-qualitative approach: visual inspection and human intuition (works well on low-dimensional and well-behaved data)

-quantitative approach: BIC (see hierarchical clustering)

How well did you know this?

Not at all

Perfectly

What are limitations of k-means method?

relying on a distance metric, k-means intrinsically expects radially symmetric (“gaussian”) clusters
proper data scaling is crucial

How well did you know this?

Not at all

Perfectly

What is a distribution-based clustering method?

expectation maximization clustering

How well did you know this?

Not at all

Perfectly

What is a big disadvantage of k-means/agglomerative clustering?

hard cluster assignment

–> each data point is assigned to a single cluster, no information on likelihood

how to compute likelihoods for each data point to belong to individual clusters?
–> EM (expectation maximization) clustering

How well did you know this?

Not at all

Perfectly

How can we compute the likelihoods for each data point to belong to an individual cluster?

Study These Flashcards

EM clustering
(expectation maximization clustering)

What is expectation maximization clustering?

Study These Flashcards

class affiliation probability for each of k clusters can be approximated by a Gaussian N (x; µi, Summe i) ;
EM Clustering is therefore a parametric clustering method

probability for data point x to belong to cluster i?
Intuition: if x-µi small, probability that it belongs to cluster i is large
if x-µi large, probability that it belongs to cluster i is small

we treat the data distribution as a mixture of Gaussians (Gaussian mixture model)

What are the steps to implement expectation maximization clustering?

Study These Flashcards

0) initialize parameters: pick random data points for centroids µi, and adopt default values for covariances Summe i

1) “expectation step”: for each data point j calculate probability to belong to cluster i

2) “maximization step”: recalculate model parameters to better fit the previously derived probability distributions (maximize probability)
–> changes the centroids and also the shape of the clusters

3) reiterate steps 1 and 2

4) termination: terminate algorithm if cluster assignments do not change anymore

What are the differences on the assumptions on the data from k-means clustering and expectation maximization clustering?

Study These Flashcards

k-means:
distribution of radially symmetric (gaussian) clusters

EM:
distributions of elongated (multi-normal) clusters

How do k-means and Expectations maximization clustering relate?

Study These Flashcards

similar algorithm, iterative two-step process for fitting of model
(k-means is actually a special case of EM)

with K-means, covariants are fixed and you only change the centroids

What is the difference on cluster assignments between k-means and expectation maximization clustering?

Study These Flashcards

k-means:
Hart Assignment (each data point is assigned to one cluster)

EM:
Soft assignment (each data point has a probability associated to each cluster)

What is a connectivity-based clustering method?

Study These Flashcards

hierarchical clustering

What is hierarchical clustering?

Study These Flashcards

builds a hierarchy of clusters based on the provided data set. the number of found clusters, k, varies in each iteration and can be considered its hyperparameter.

hierarchical clustering is non-parametric

assumes that data point that are close to each other are more likely to be part of the same cluster

What are two different hierarchical- clustering approaches that work very similarly?

Study These Flashcards

agglomerative clustering (bottom-up)
–> clusters are merged based on a distance function
divisive clustering (top-down)
–> clusters are split based on distance function

What are steps to agglomerative clustering?

0) initialization: each data point forms its own cluster 1) merge closest neighbors: find closest neighbors and merge them (euclidean distance metric with "single-linkage" criteria) 2) merge closest neighbors: find closest neighbors and merge them, if neighbor is already merges in the same step, simply omit) 3) merge closest neighbors: find closest neighbors and merge them (same as step 2) 4) terminate: all data points form a single cluster --> forms a dendrogram, "family tree" at different iterations we find different numbers of clusters

What is the "best-fit" number of clusters k for agglomerative clustering?

- qualitative approach: visual inspection and human intuition (works well for low-dimensional and well-behaved data) - quantitative approach: Bayes Information Criterion (BIC)

What is BIC?

Bayes Information Criterion BIC = number of model parameters + ln(number of data points) - 2ln(maximum likelihood of model describing data) Idea: minimize BIX to find the least complex model that fits the data well

What is an advantage of agglomerative clustering?

much better able to deal with non-Gaussian clusters

What is a disadvantage of agglomerative clustering?

data points belonging to the same cluster might be assigned to different clusters if highly elongated

What is a density-based clustering method?

DBSCAN Density-based spatial clustering of applications with noise

What does DBSCAN stand for?

Density-based spatial clustering of applications with noise

What is DBSCAN?

define clusters based on local over-densities in the data supports the notion of noise: data points that are not part of any cluster will be considered noise uses two hyperparameters: - the distance within which to define a neighborhood, epsilon (radius) - the number of data points per neighborhood to consider it a cluster N = minimum number of data points per neighborhood to form a cluster

What are steps to the DBSCAN algorithm?

0) initialization: pick a random data point and place an epsilon-neighborhood around it 1) find number of neighbors: check how many neighbors the data point ahs within the epsilon-neighborhood (if not enough neighbors, data point is considered noise) 2) create a new cluster: and assign it to the current data point. repeat this step with same cluster label for all data points in the neighborhood 3) reiterate steps 1 and 2: data points are either noise or there is a new neighborhood to create 4) terminate? (assumed: when all data points are categorized)

How sensitive are the results of DBSCAN to the set of hyperparameters?

high sensitivity to epsilon: small e: clusters are too small large e: clusters are too big --> needs a considerable amount of fine-tuning

What are advantages of DBSCAN? (2)

- often provides clearly delineated clusters of different shapes, if the hyperparameters were chosen properly - ability to identify noise

What is a dimensionality reduction method?

principal components analysis

What is dimensionality reduction?

helps to reduce the number of dimensions (features) in the data set closely related to feature selection: provides a means to identify the important features in the data set

What is principal component analysis (PCA)?

fits a multinormal ellipsoid to the data set. each axis of the ellipsoid is called a PRINCIPAL COMPONENT of the data set and its length relates to the variance along that axis (the more important a principal component, the longer it is) --> important = descriptive for the data set each principal component is a linear combination of feature vectors --> orthonormal basis for the data set to achieve dimensionality reduction, only a subset of all principal components are utilized to represent data allows you to transform (rotate) a data set into a more meaningful representation

How is PCA done? (principal component analysis)

mathematically, principal components are EIGENVECTORS of the data set's covariance matrix --> eigendecomposition or singular value decomposition required the underlying data to be scaled properly (zero mean, unity variance) --> otherwise mean offset strongly affects the principal components: the first component will most likely point to the mean of the data set

07_unsupervised learning Flashcards

(39 cards)