Unsupervised Learning Flashcards
Too much data?
* Reduce it
- by number of examples – clustering
- by number of dimensions – dimensionality reduction
Clustering:
Find meaningful groupings in data
Topic modeling:
Discover topics/groups with which we can tag data points
Dimensionality reduction/compression:
Compress data set by removing redundancy and retaining only useful information
K-means
Unsupervised method for clustering
* Partition of data into k clusters based on features (k must be given)
* A cluster is represented by its centroid
* Each point is assigned to its nearest centroid/cluster
K-means Goal
To minimize intra-cluster distance – measure of distance between data and the correspondent cluster centroid
K-means: Iterative
Pick k random points as centroids
Repeat until no further reassignment:
- Assign points to its nearest centroid
- Change each cluster centroid to the average of points assigned to it
A good clustering is one that achieves
- High within-cluster similarity
- Low inter-cluster similarity
K-means Clustering fast and working well if. . . clusters:
- are spherically shaped
- are well separated
- have similar volumes
- a similar number of points (densities)
K-means Issues
clusters of very different sizes
clusters with different densities
non spherical clusters
Expectation Maximization (EM)
Algorithm
- Generate k random probability distributions [1]
Iterate until convergence: - Find which points are more probable to be generated by each probability distribution (E-step) [2]
- Re-calculate distributions parameters (M-step) [3]
EM Algorithm
Learning mixtures of Gaussians
Hierarchical Unsupervised
Learning
Find successive clusters using previously established ones (organized as an hierarchical tree)
- Agglomerative (bottom-up)
- Divisive (top-down)
Non Parametric Clustering –
DBSCAN
- Find the points in the ε-distance
neighbourhood of every point, and
identify the core points with more
than the minimum number of
points to form a dense region - Find the connected components of
the core points on the neighbours
graph, ignoring all non-core points - Assign each non-core point to a
nearby cluster if the cluster is an
ε-distance neighbour, otherwise
assign it to noise
SOM (Self-organizing Maps)
Similar to K-means, but . . .
* Representatives are organized (have neighbours) in a 1D or 2D mesh
* When a representative is selected to be adapted, neighbours are also updated (with a lower rate)
Dimensionality Reduction
- When you represent data using fewer dimensions, makes it easier to learn (fewer parameters)
- Discover intrinsic dimensionality of data
- High dimensional data that is really lower dimensional
- Noise reduction
Dimensionality Reduction: Feature selection
Choose a (the best) subset of all the features
Dimensionality Reduction: Feature extraction
Create new features by combining existing ones
Features reduction: lower dimensional projections
Instead of picking a subset of the features, consider ‘new’ ones by combining some of the existing features
Lower Dimensional Projections
Project n-dimensional data into k-dimensional space preserving
* information
* maximizing variance
* with minimum reconstruction error
Principal Component Analysis (PCA)
Projection that best represents the data in
the least-squares sense
Smaller projection errors;
smaller overall sum of squares
Principal Component Analysis: Algorithm
Considering a Xm×n data matrix,
- Re-center: subtract mean from X,
XC ← X − µ - Compute covariance matrix: C ← 1/m · XTC · XC
- Compute the k largest eigenvectors in C: the Principal Components
- y = ETXC is the closest PCA approximation to X
Difficulties Principal Component Analysis
- Covariance matrix can (still) be huge!
- Finding eigenvectors can be a slow process
Topic Models
- Used in Natural Language Processing
- Use different mathematical frameworks:
Linear algebra vs. probabilistic modeling
Topic Models share 3 fundamental assumptions
- Documents have a latent semantic structure (“topics”)
- Can infer topics from word-document co-occurrences
- Words are related to topics, topics to documents
Latent Semantic Analysis (linear algebra-based topic model)
Extension of the Vector Space Model (words × documents)
- Reduces the high-dimensional vector space into a smaller space by collapsing the original representation into a smaller set of latent dimensions (“topics”) that aggregates words and
documents with similar context vectors - Based on the Singular Value Decomposition (SVD)
- Entropy-based term weighting
Latent Dirichlet Allocation: probabilistic topic model
Documents are represented as random mixtures over latent topics
Topics are characterized by a distribution over words
Words can belong to multiple topics
* z is the topic of a word in the document
* β is the word probability distribution of each topic
A document can have multiple topics:
* θ is the topic distribution of a document
* α is the distribution of distribution of topics