Pick k random points as centroids Repeat until no further reassignment: Assign points to its nearest centroid Change each cluster centroid to the average of points assigned to it

Unsupervised Learning Flashcards by Miguel Teodoro

Too much data?
* Reduce it

by number of examples – clustering
by number of dimensions – dimensionality reduction

How well did you know this?

Not at all

Perfectly

Clustering:

Find meaningful groupings in data

How well did you know this?

Not at all

Perfectly

Topic modeling:

Discover topics/groups with which we can tag data points

How well did you know this?

Not at all

Perfectly

Dimensionality reduction/compression:

Compress data set by removing redundancy and retaining only useful information

How well did you know this?

Not at all

Perfectly

K-means

Unsupervised method for clustering
* Partition of data into k clusters based on features (k must be given)
* A cluster is represented by its centroid
* Each point is assigned to its nearest centroid/cluster

How well did you know this?

Not at all

Perfectly

K-means Goal

To minimize intra-cluster distance – measure of distance between data and the correspondent cluster centroid

How well did you know this?

Not at all

Perfectly

K-means: Iterative

Pick k random points as centroids
Repeat until no further reassignment:

Assign points to its nearest centroid
Change each cluster centroid to the average of points assigned to it

How well did you know this?

Not at all

Perfectly

A good clustering is one that achieves

High within-cluster similarity
Low inter-cluster similarity

How well did you know this?

Not at all

Perfectly

K-means Clustering fast and working well if. . . clusters:

are spherically shaped
are well separated
have similar volumes
a similar number of points (densities)

How well did you know this?

Not at all

Perfectly

K-means Issues

clusters of very different sizes

clusters with different densities

non spherical clusters

How well did you know this?

Not at all

Perfectly

Expectation Maximization (EM)
Algorithm

Generate k random probability distributions [1]
Iterate until convergence:
Find which points are more probable to be generated by each probability distribution (E-step) [2]
Re-calculate distributions parameters (M-step) [3]

How well did you know this?

Not at all

Perfectly

EM Algorithm

Learning mixtures of Gaussians

How well did you know this?

Not at all

Perfectly

Hierarchical Unsupervised
Learning

Find successive clusters using previously established ones (organized as an hierarchical tree)

Agglomerative (bottom-up)
Divisive (top-down)

How well did you know this?

Not at all

Perfectly

Non Parametric Clustering –
DBSCAN

Find the points in the ε-distance
neighbourhood of every point, and
identify the core points with more
than the minimum number of
points to form a dense region
Find the connected components of
the core points on the neighbours
graph, ignoring all non-core points
Assign each non-core point to a
nearby cluster if the cluster is an
ε-distance neighbour, otherwise
assign it to noise

How well did you know this?

Not at all

Perfectly

SOM (Self-organizing Maps)

Similar to K-means, but . . .
* Representatives are organized (have neighbours) in a 1D or 2D mesh
* When a representative is selected to be adapted, neighbours are also updated (with a lower rate)

How well did you know this?

Not at all

Perfectly

Dimensionality Reduction

Study These Flashcards

When you represent data using fewer dimensions, makes it easier to learn (fewer parameters)
Discover intrinsic dimensionality of data
High dimensional data that is really lower dimensional
Noise reduction

Dimensionality Reduction: Feature selection

Study These Flashcards

Choose a (the best) subset of all the features

Dimensionality Reduction: Feature extraction

Study These Flashcards

Create new features by combining existing ones

Features reduction: lower dimensional projections

Study These Flashcards

Instead of picking a subset of the features, consider ‘new’ ones by combining some of the existing features

Lower Dimensional Projections

Study These Flashcards

Project n-dimensional data into k-dimensional space preserving
* information
* maximizing variance
* with minimum reconstruction error

Principal Component Analysis (PCA)

Study These Flashcards

Projection that best represents the data in
the least-squares sense

Smaller projection errors;

Study These Flashcards

smaller overall sum of squares

Principal Component Analysis: Algorithm

Study These Flashcards

Considering a Xm×n data matrix,

Re-center: subtract mean from X,
XC ← X − µ
Compute covariance matrix: C ← 1/m · XTC · XC
Compute the k largest eigenvectors in C: the Principal Components
y = ETXC is the closest PCA approximation to X

Difficulties Principal Component Analysis

Study These Flashcards

Covariance matrix can (still) be huge!
Finding eigenvectors can be a slow process

Topic Models

* Used in Natural Language Processing * Use different mathematical frameworks: Linear algebra vs. probabilistic modeling

Topic Models share 3 fundamental assumptions

* Documents have a latent semantic structure (“topics”) * Can infer topics from word-document co-occurrences * Words are related to topics, topics to documents

Latent Semantic Analysis (linear algebra-based topic model)

Extension of the Vector Space Model (words × documents) * Reduces the high-dimensional vector space into a smaller space by collapsing the original representation into a smaller set of latent dimensions (“topics”) that aggregates words and documents with similar context vectors * Based on the Singular Value Decomposition (SVD) * Entropy-based term weighting

Latent Dirichlet Allocation: probabilistic topic model

Documents are represented as random mixtures over latent topics Topics are characterized by a distribution over words Words can belong to multiple topics * z is the topic of a word in the document * β is the word probability distribution of each topic A document can have multiple topics: * θ is the topic distribution of a document * α is the distribution of distribution of topics

Unsupervised Learning Flashcards

(28 cards)