Unsupervised Learning Flashcards
reason for clustering
categorize data without labels
cluster searching
- search for closest cluster
- search within the cluster
** Note: much faster than searching each document
k-means training
- initialize k random centers
- calculate points in each cluster
- recalculate the cluster centers using the mean
responsibility
fuzzy k-means (soft k-means) algorithm
- reflects confidence in cluster
- initialize k random centers
- calculate responsibilities (soft max)
- calculate mean responsibilities (hard k-means is where r is 0 or 1)
k-means cost function
- Euclidean - sensitive to scale and K, K=N -> 0 cost
- Davies Bouldin index - low ‘inter’ std and high ‘intra’ std; lower better
where k-means fails
- donut
- assymetric (high var) gaussian
- smaller cluster in bigger one
k-means problems
- choice of k
- local minimum is bad in this case
- sensitive to initial clusters
- can only find spherical clusters
- does take density into account
ideal k value
- where the steep meets the flat
- not at the minimum
hierarchical (agglomerative) clustering
- greedy algorithm
- merge two closest points until you have 1 group with all of the points
- threshold the cluster intra distance
joining cluster
- single linkage: min distance between any two points
- complete linkage: max distance b/w any two points
- mean distance (UPGMA): mean distance between all pts
- Wards: minimize increase in variance
dendrogram
distances equations
valid distance metrics
- non-negative: d(x, y) ≥ 0, positive
- identity: (x, y) = 0 <=> x=y, definite
- symmetry: d(x, y) = d(y, x)
- subadditive: d(x, z) ≤ d(x, y) + d(y, z)
self-organized maps
- apply competitive learning as opposed to error-correction learning
- dimension reduces to discretized representation of the input
- called a map
gaussian mixure model (GMM)
- form of density estimation
- gives approximation of the probability distribution
- use when notice multimodal data
- p(x)=π1N(µ1,∑ 1)1+π2N(µ2,∑ 2)2
- π=probility x belongs to to Gaussian k
- π sums to 1
- introduce latent Z for which π=p(Z=k)
*
GMM algorithm
responsibility k-means v. GMM
- k-mean cluster use of softmax
- GMM uses normal times probabilities
independent component analysis (ICA)
- special type of blind source separation
- cocktail party problem of listening in on one person’s speech in a noisy room
- data: x=(x1,…,xm)T
- hidden components: s=(s1,…,sn)T
- mixing matrix: W n by n
- unmixing matrix recovers source: W=A-1.
- generative model (noiseless): x=As
- noisy: x=As+n
*
benefits of GMM over fuzzy k-means
- GMM has multiple π as opposed to one ß
- it can handle elliptical whereas fuzzy k-means needs to be spherical
GMM v. Fuzzy K-means
singular covariance problem
- close points approach 0
- division by 0 is singularity
- local minimum
ways of dealing with singular covariance problem
- diagonal covariance -
- solution to singular covariance
- speed up computation (easy to take inverse)
- spherical gaussian
- shared covariance
- each gaussian has the same covariance
diagonal covariance
kernel density estimation
fitting PDFs with kernels; can use GMM
expectation-maximization (EM)
- general version of GMM
- latent variables
- coordinated descent
- increasing likelihood guarantee (max likelihood)
dimensionality reduction
1) cleaner signal; 2) decreased transformation load -> reduced time; 3) allow visualization of data
single value decomposition (SVD)
PCA
- Z=XQ
- rotates the data (changes eigenvector and values), doesn’t change the length
- high variance = signal
- low variance = noise
- keep left most columns of Z (~95 %)