Unsupervised Learning Flashcards
reason for clustering
categorize data without labels
cluster searching
- search for closest cluster
- search within the cluster
** Note: much faster than searching each document
k-means training
- initialize k random centers
- calculate points in each cluster
- recalculate the cluster centers using the mean
responsibility

fuzzy k-means (soft k-means) algorithm
- reflects confidence in cluster
- initialize k random centers
- calculate responsibilities (soft max)
- calculate mean responsibilities (hard k-means is where r is 0 or 1)
k-means cost function
- Euclidean - sensitive to scale and K, K=N -> 0 cost
- Davies Bouldin index - low ‘inter’ std and high ‘intra’ std; lower better
where k-means fails
- donut
- assymetric (high var) gaussian
- smaller cluster in bigger one
k-means problems
- choice of k
- local minimum is bad in this case
- sensitive to initial clusters
- can only find spherical clusters
- does take density into account
ideal k value
- where the steep meets the flat
- not at the minimum
hierarchical (agglomerative) clustering
- greedy algorithm
- merge two closest points until you have 1 group with all of the points
- threshold the cluster intra distance
joining cluster
- single linkage: min distance between any two points
- complete linkage: max distance b/w any two points
- mean distance (UPGMA): mean distance between all pts
- Wards: minimize increase in variance
dendrogram

distances equations

valid distance metrics
- non-negative: d(x, y) ≥ 0, positive
- identity: (x, y) = 0 <=> x=y, definite
- symmetry: d(x, y) = d(y, x)
- subadditive: d(x, z) ≤ d(x, y) + d(y, z)
self-organized maps
- apply competitive learning as opposed to error-correction learning
- dimension reduces to discretized representation of the input
- called a map
gaussian mixure model (GMM)
- form of density estimation
- gives approximation of the probability distribution
- use when notice multimodal data
- p(x)=π1N(µ1,∑ 1)1+π2N(µ2,∑ 2)2
- π=probility x belongs to to Gaussian k
- π sums to 1
- introduce latent Z for which π=p(Z=k)
*
GMM algorithm

responsibility k-means v. GMM
- k-mean cluster use of softmax
- GMM uses normal times probabilities
independent component analysis (ICA)
- special type of blind source separation
- cocktail party problem of listening in on one person’s speech in a noisy room
- data: x=(x1,…,xm)T
- hidden components: s=(s1,…,sn)T
- mixing matrix: W n by n
- unmixing matrix recovers source: W=A-1.
- generative model (noiseless): x=As
- noisy: x=As+n
*
benefits of GMM over fuzzy k-means
- GMM has multiple π as opposed to one ß
- it can handle elliptical whereas fuzzy k-means needs to be spherical
GMM v. Fuzzy K-means

singular covariance problem
- close points approach 0
- division by 0 is singularity
- local minimum
ways of dealing with singular covariance problem
- diagonal covariance -
- solution to singular covariance
- speed up computation (easy to take inverse)
- spherical gaussian
- shared covariance
- each gaussian has the same covariance
diagonal covariance

kernel density estimation
fitting PDFs with kernels; can use GMM
expectation-maximization (EM)
- general version of GMM
- latent variables
- coordinated descent
- increasing likelihood guarantee (max likelihood)
dimensionality reduction
1) cleaner signal; 2) decreased transformation load -> reduced time; 3) allow visualization of data
single value decomposition (SVD)

PCA
- Z=XQ
- rotates the data (changes eigenvector and values), doesn’t change the length
- high variance = signal
- low variance = noise
- keep left most columns of Z (~95 %)
how does PCA help?
- cleaner signal
- decreased transformation load -> reduced time
- allow visualization of data
naive bayes connection to PCA
PCA makes IID assumption true
greedy layer-wise pretraining
- helps with supervised learning
- only objective of layer is to train itself
- only requires a few epochs of training to fine-tune
autoencoder
- J=|X-s(S(XW)W.T)|**2
- non-linear PCA
- linear activation gives PCA
- predicts itself (auto=self)
- square error for x < 0 and x > 0
- cross-entropy works for 0 ≤ x ≤ 1
- different biases for hidden and output layers
- shared weights (type of regularization
types of autoencoders
- Denoising autoencoder
- Sparse autoencoder
- Variational autoencoder (VAE)
- Contractive autoencoder (CAE)
bottleneck network
- find useful and efficient representation
- compress V dimension into D in the encoding phase
stacking
- we only care about the “inner” representation
- discard the 2nd half each time
- each layer smaller in size
- train an autoencoder on X with HL output Z;
- train a 2nd autoencoder on Z;
- repeat
denoising autencoder
- an autoencoder that is missing information
- attempt to address identity-function risk by randomly corrupting input
- its like training with dropout
- missing data not always represented by 0
- mask the cost in back propagation

sparse autoencoder
- when you train an autoencoder, the hidden units in the middle layer would fire (activate) too frequently, for most training samples
- sparsity constraint: lower the activation rate so that they only activate for a small fraction of the training examples
- sparse because each unit only activates to a certain type of inputs
how to reproduce missing data
- use autoencode or the likes
- *

recommender systems
- google, youtube, netflix
- use RBMs and autoencoders
*
naming convention
- Recommender
- N is instances
- M is movies
- K is latent variables
- Typical Deep Learning
- N is instances (same)
- M is hidden units
- K is number of classes
- D is input dimensions
boltzmann machine
- everything is connected to everything
- only works on trivial problems, unlike RBM
- non-tractable
restricted boltzmann machine (RBM)
- bipartite graph
- produces visible vector from hidden and vice versa
- assume each hidden and output is IID
- use sigmoid (to keep output 0-1)
- act just like autoencoders
- find latent variables
- greedy pretraining
- non-tractable, therefore, needs to be approximated hidden units only connected to the visible unit
- square error for all X
- cross-entropy works for 0 ≤ x ≤ 1
categorical RBM

categorical RBM shapes
- h = M
- V = D x K
- D is number of movies
- K is the number of classes
- W = D x K x M
- b = D x K
- x = M
deep belief network
deep network of stacked RBMs
latent variable
variable that is hidden
markov random field
- graph of states
- each node is a state
- state of each node only relies on adjacent node
relaxed bernouli constraint
in RBM,
- threshold
- let the values go to whatever
hidden markov model
* unsupervised learning method
- models sequences & classifies
- modified using bayes rules to create separate model for each class
- choose class with the highest likelihood
HMM applications
- recognizing male or female voice
- NLP: macbook was created by “p(x|x-, x–)”
- stock predicition
- SEO and bounce rate optimization
- speech to text: words to sound
markov property
- p(x | all time) = p(x | x-)
- strong assumption
joint probability
- chain rule of probability
- p(s3, s2, s1) = p(s3|s2)p(s2|s1)p(s1)
markov order
- 1st order - p(y2 | y1)
- 2nd order - p(y3 | y2, y1)
- 3rd order - p(y4 | y3, y2, y1)
markov model
- M states
- π = 1 x M, column vector
- A = M x M weights
- A(i, :).sum() = 1
smoothing
- add smoothing to account for unobserved
- p = (count(x)+λ)/(N+λµ0)
- can smooth average ratings too
- r = (∑xi+λµ0)/(N+λ)
HMM model
- πi = probability of starting at state i
- A(i, j)
- state transition matrix
- probability of going to state j from i
- P(tag1|tag0)
- if row sum to 1, A is markov matrix
- B(j, k)
- observation probability matrix
- probability of observing k given j
- p(word|tag)
double embedded
- layer 1: Markov model
- layer 2: choose observation
forward, forward-backward algorithm

backward, forward-backwards algorithm

forward backwards code

viterbi algorithm
- find the most probable hidden state sequence given the observed sequence
- the same as forward backward algorithm except taking the max instead of sum
- delta variable is like alpha
- psi is similar to delta but argmax instead and no B term
- have to backtrack the states
vertirbi initialization

vertirbi recursion

vertirbi termination

Baum-Welch algorithm
- Is similar to Gaussian mixture models
- Use expectation maximization (iterative algorithm) instead of maximum likelihood
- review at some other date, it is a very expensive algorithm and should be avoided if possible
variational autoencoder
- Bayersian ML + Deep Learning
generative adversarial network (GAN)
- 2 networks competing against eachother
*
generative versus discriminitive
- generative: P(X|Y)
- discriminitive: P(Y|X)
- hard to tell what is happening with weights
- not satisfactorial to statistician
- research shows that discriminitive works better
types of recommender systems
- autorec: uses autoencoder to recommend missing data
- non-personalized:
- shows popular items
- confidence
- takes time into account
- product associations
- shows popular items
- collaborative filter: what YOU and what OTHERS similar to you like
matrix factorization
- SVD example is an: X=WSUT
- S is diagonal matrix of the variances
- more variance = more important
- we only want square error where rating is known
- break up X into components for less parameters stored
- use K size ~50-100
- Aij=log(Xij)~=wiTu
account for age in recommender system?
- hacker news: penalty*(ups-downs-1)0.8/(age+2)gravity
- gravity = 1.8
- age in hours
- business rules:
- penalize self, controversial and popular websites posts
- reddit: sign(ups-downs)log{max(1, |ups-downs|)}+age/45000
- age in seconds
ranking with big systems
- discard users w/ few movies in common
- sum over nearest neighbors
- take ones with higheset weights
- say k = 25 to 50
- closest absolute correlation
- closest raw correlation
- or use everybody
training matrix factorization
- J = ∑(tij-wiTuj)
- dJ/dwi = ∑(tij-wiTuj)uj=∑ujujTwi
- wi = (∑ujujT)-1∑rijujT
- j is of Ψ
- dJ/duj = ∑(tij-wiTuj)uj=∑ujujTwi
- uj = (∑wjwjT)-1∑rijwi
- j is of Ω
- but really, rij=wiTuj+bi+cj+µ
- cj = rij-wiTuj+bi+µ
- bi = rij-wiTuj+cj+µ
- µ = global average