Unsupervised Learning Flashcards

1
Q

Too much data?
* Reduce it

A
  • by number of examples – clustering
  • by number of dimensions – dimensionality reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Clustering:

A

Find meaningful groupings in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Topic modeling:

A

Discover topics/groups with which we can tag data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Dimensionality reduction/compression:

A

Compress data set by removing redundancy and retaining only useful information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

K-means

A

Unsupervised method for clustering
* Partition of data into k clusters based on features (k must be given)
* A cluster is represented by its centroid
* Each point is assigned to its nearest centroid/cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

K-means Goal

A

To minimize intra-cluster distance – measure of distance between data and the correspondent cluster centroid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

K-means: Iterative

A

Pick k random points as centroids
Repeat until no further reassignment:

  • Assign points to its nearest centroid
  • Change each cluster centroid to the average of points assigned to it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A good clustering is one that achieves

A
  • High within-cluster similarity
  • Low inter-cluster similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

K-means Clustering fast and working well if. . . clusters:

A
  • are spherically shaped
  • are well separated
  • have similar volumes
  • a similar number of points (densities)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

K-means Issues

A

clusters of very different sizes

clusters with different densities

non spherical clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Expectation Maximization (EM)
Algorithm

A
  • Generate k random probability distributions [1]
    Iterate until convergence:
  • Find which points are more probable to be generated by each probability distribution (E-step) [2]
  • Re-calculate distributions parameters (M-step) [3]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

EM Algorithm

A

Learning mixtures of Gaussians

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Hierarchical Unsupervised
Learning

A

Find successive clusters using previously established ones (organized as an hierarchical tree)

  • Agglomerative (bottom-up)
  • Divisive (top-down)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Non Parametric Clustering –
DBSCAN

A
  1. Find the points in the ε-distance
    neighbourhood of every point, and
    identify the core points with more
    than the minimum number of
    points to form a dense region
  2. Find the connected components of
    the core points on the neighbours
    graph, ignoring all non-core points
  3. Assign each non-core point to a
    nearby cluster if the cluster is an
    ε-distance neighbour, otherwise
    assign it to noise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

SOM (Self-organizing Maps)

A

Similar to K-means, but . . .
* Representatives are organized (have neighbours) in a 1D or 2D mesh
* When a representative is selected to be adapted, neighbours are also updated (with a lower rate)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Dimensionality Reduction

A
  • When you represent data using fewer dimensions, makes it easier to learn (fewer parameters)
  • Discover intrinsic dimensionality of data
  • High dimensional data that is really lower dimensional
  • Noise reduction
17
Q

Dimensionality Reduction: Feature selection

A

Choose a (the best) subset of all the features

18
Q

Dimensionality Reduction: Feature extraction

A

Create new features by combining existing ones

19
Q

Features reduction: lower dimensional projections

A

Instead of picking a subset of the features, consider ‘new’ ones by combining some of the existing features

20
Q

Lower Dimensional Projections

A

Project n-dimensional data into k-dimensional space preserving
* information
* maximizing variance
* with minimum reconstruction error

21
Q

Principal Component Analysis (PCA)

A

Projection that best represents the data in
the least-squares sense

22
Q

Smaller projection errors;

A

smaller overall sum of squares

23
Q

Principal Component Analysis: Algorithm

A

Considering a Xm×n data matrix,

  • Re-center: subtract mean from X,
    XC ← X − µ
  • Compute covariance matrix: C ← 1/m · XTC · XC
  • Compute the k largest eigenvectors in C: the Principal Components
  • y = ETXC is the closest PCA approximation to X
24
Q

Difficulties Principal Component Analysis

A
  • Covariance matrix can (still) be huge!
  • Finding eigenvectors can be a slow process
25
Q

Topic Models

A
  • Used in Natural Language Processing
  • Use different mathematical frameworks:
    Linear algebra vs. probabilistic modeling
26
Q

Topic Models share 3 fundamental assumptions

A
  • Documents have a latent semantic structure (“topics”)
  • Can infer topics from word-document co-occurrences
  • Words are related to topics, topics to documents
27
Q

Latent Semantic Analysis (linear algebra-based topic model)

A

Extension of the Vector Space Model (words × documents)

  • Reduces the high-dimensional vector space into a smaller space by collapsing the original representation into a smaller set of latent dimensions (“topics”) that aggregates words and
    documents with similar context vectors
  • Based on the Singular Value Decomposition (SVD)
  • Entropy-based term weighting
28
Q

Latent Dirichlet Allocation: probabilistic topic model

A

Documents are represented as random mixtures over latent topics

Topics are characterized by a distribution over words

Words can belong to multiple topics
* z is the topic of a word in the document
* β is the word probability distribution of each topic

A document can have multiple topics:
* θ is the topic distribution of a document
* α is the distribution of distribution of topics