Unsupervised Learning Flashcards

1
Q

reason for clustering

A

categorize data without labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

cluster searching

A
  1. search for closest cluster
  2. search within the cluster

** Note: much faster than searching each document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

k-means training

A
  1. initialize k random centers
  2. calculate points in each cluster
  3. recalculate the cluster centers using the mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

responsibility

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

fuzzy k-means (soft k-means) algorithm

A
  • reflects confidence in cluster
  1. initialize k random centers
  2. calculate responsibilities (soft max)
  3. calculate mean responsibilities (hard k-means is where r is 0 or 1)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

k-means cost function

A
  • Euclidean - sensitive to scale and K, K=N -> 0 cost
  • Davies Bouldin index - low ‘inter’ std and high ‘intra’ std; lower better
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

where k-means fails

A
  1. donut
  2. assymetric (high var) gaussian
  3. smaller cluster in bigger one
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

k-means problems

A
  1. choice of k
  2. local minimum is bad in this case
  3. sensitive to initial clusters
  4. can only find spherical clusters
  5. does take density into account
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ideal k value

A
  • where the steep meets the flat
  • not at the minimum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

hierarchical (agglomerative) clustering

A
  • greedy algorithm
  • merge two closest points until you have 1 group with all of the points
  • threshold the cluster intra distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

joining cluster

A
  1. single linkage: min distance between any two points
  2. complete linkage: max distance b/w any two points
  3. mean distance (UPGMA): mean distance between all pts
  4. Wards: minimize increase in variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

dendrogram

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

distances equations

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

valid distance metrics

A
  1. non-negative: d(x, y) ≥ 0, positive
  2. identity: (x, y) = 0 <=> x=y, definite
  3. symmetry: d(x, y) = d(y, x)
  4. subadditive: d(x, z) ≤ d(x, y) + d(y, z)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

self-organized maps

A
  • apply competitive learning as opposed to error-correction learning
  • dimension reduces to discretized representation of the input
  • called a map
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

gaussian mixure model (GMM)

A
  • form of density estimation
  • gives approximation of the probability distribution
  • use when notice multimodal data
  • p(x)=π1N(µ1,∑ 1)1+π2N(µ2,∑ 2)2
  • π=probility x belongs to to Gaussian k
  • π sums to 1
  • introduce latent Z for which π=p(Z=k)
    *
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

GMM algorithm

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

responsibility k-means v. GMM

A
  • k-mean cluster use of softmax
  • GMM uses normal times probabilities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

independent component analysis (ICA)

A
  • special type of blind source separation
  • cocktail party problem of listening in on one person’s speech in a noisy room
  • data: x=(x1,…,xm)T
  • hidden components: s=(s1,…,sn)T
  • mixing matrix: W n by n
  • unmixing matrix recovers source: W=A-1.
  • generative model (noiseless): x=As
  • noisy: x=As+n
    *
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

benefits of GMM over fuzzy k-means

A
  • GMM has multiple π as opposed to one ß
  • it can handle elliptical whereas fuzzy k-means needs to be spherical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

GMM v. Fuzzy K-means

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

singular covariance problem

A
  • close points approach 0
    • division by 0 is singularity
  • local minimum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

ways of dealing with singular covariance problem

A
  1. diagonal covariance -
    1. solution to singular covariance
    2. speed up computation (easy to take inverse)
  2. spherical gaussian
  3. shared covariance
  • each gaussian has the same covariance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

diagonal covariance

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

kernel density estimation

A

fitting PDFs with kernels; can use GMM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

expectation-maximization (EM)

A
  • general version of GMM
  • latent variables
  • coordinated descent
  • increasing likelihood guarantee (max likelihood)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

dimensionality reduction

A

1) cleaner signal; 2) decreased transformation load -> reduced time; 3) allow visualization of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

single value decomposition (SVD)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

PCA

A
  • Z=XQ
  • rotates the data (changes eigenvector and values), doesn’t change the length
  • high variance = signal
  • low variance = noise
  • keep left most columns of Z (~95 %)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

how does PCA help?

A
  1. cleaner signal
  2. decreased transformation load -> reduced time
  3. allow visualization of data
31
Q

naive bayes connection to PCA

A

PCA makes IID assumption true

32
Q

greedy layer-wise pretraining

A
  • helps with supervised learning
  • only objective of layer is to train itself
  • only requires a few epochs of training to fine-tune
33
Q

autoencoder

A
  • J=|X-s(S(XW)W.T)|**2
  • non-linear PCA
  • linear activation gives PCA
  • predicts itself (auto=self)
  • square error for x < 0 and x > 0
  • cross-entropy works for 0 ≤ x ≤ 1
  • different biases for hidden and output layers
  • shared weights (type of regularization
34
Q

types of autoencoders

A
  1. Denoising autoencoder
  2. Sparse autoencoder
  3. Variational autoencoder (VAE)
  4. Contractive autoencoder (CAE)
35
Q

bottleneck network

A
  • find useful and efficient representation
  • compress V dimension into D in the encoding phase
36
Q

stacking

A
  • we only care about the “inner” representation
  • discard the 2nd half each time
  • each layer smaller in size
  1. train an autoencoder on X with HL output Z;
  2. train a 2nd autoencoder on Z;
  3. repeat
37
Q

denoising autencoder

A
  • an autoencoder that is missing information
  • attempt to address identity-function risk by randomly corrupting input
  • its like training with dropout
  • missing data not always represented by 0
    • mask the cost in back propagation
38
Q

sparse autoencoder

A
  • when you train an autoencoder, the hidden units in the middle layer would fire (activate) too frequently, for most training samples
  • sparsity constraint: lower the activation rate so that they only activate for a small fraction of the training examples
  • sparse because each unit only activates to a certain type of inputs
39
Q

how to reproduce missing data

A
  • use autoencode or the likes
  • *
40
Q

recommender systems

A
  • google, youtube, netflix
  • use RBMs and autoencoders
    *
41
Q

naming convention

A
  • Recommender
    • N is instances
    • M is movies
    • K is latent variables
  • Typical Deep Learning
    • N is instances (same)
    • M is hidden units
    • K is number of classes
    • D is input dimensions
42
Q

boltzmann machine

A
  • everything is connected to everything
  • only works on trivial problems, unlike RBM
  • non-tractable
43
Q

restricted boltzmann machine (RBM)

A
  • bipartite graph
  • produces visible vector from hidden and vice versa
  • assume each hidden and output is IID
  • use sigmoid (to keep output 0-1)
  • act just like autoencoders
    • find latent variables
    • greedy pretraining
  • non-tractable, therefore, needs to be approximated hidden units only connected to the visible unit
  • square error for all X
  • cross-entropy works for 0 ≤ x ≤ 1
44
Q

categorical RBM

A
45
Q

categorical RBM shapes

A
  • h = M
  • V = D x K
    • D is number of movies
    • K is the number of classes
  • W = D x K x M
  • b = D x K
  • x = M
46
Q

deep belief network

A

deep network of stacked RBMs

47
Q

latent variable

A

variable that is hidden

48
Q

markov random field

A
  • graph of states
  • each node is a state
  • state of each node only relies on adjacent node
49
Q

relaxed bernouli constraint

A

in RBM,

  1. threshold
  2. let the values go to whatever
50
Q

hidden markov model

A

* unsupervised learning method

  • models sequences & classifies
  • modified using bayes rules to create separate model for each class
  • choose class with the highest likelihood
51
Q

HMM applications

A
  1. recognizing male or female voice
  2. NLP: macbook was created by “p(x|x-, x–)”
  3. stock predicition
  4. SEO and bounce rate optimization
  5. speech to text: words to sound
52
Q

markov property

A
  • p(x | all time) = p(x | x-)
  • strong assumption
53
Q

joint probability

A
  • chain rule of probability
  • p(s3, s2, s1) = p(s3|s2)p(s2|s1)p(s1)
54
Q

markov order

A
  • 1st order - p(y2 | y1)
  • 2nd order - p(y3 | y2, y1)
  • 3rd order - p(y4 | y3, y2, y1)
55
Q

markov model

A
  • M states
  • π = 1 x M, column vector
  • A = M x M weights
  • A(i, :).sum() = 1
56
Q

smoothing

A
  • add smoothing to account for unobserved
    • p = (count(x)+λ)/(N+λµ0)
  • can smooth average ratings too
    • r = (∑xi+λµ0)/(N+λ)
57
Q

HMM model

A
  1. πi = probability of starting at state i
  2. A(i, j)
    1. state transition matrix
    2. probability of going to state j from i
    3. P(tag1|tag0)
    4. if row sum to 1, A is markov matrix
  3. B(j, k)
    1. observation probability matrix
    2. probability of observing k given j
    3. p(word|tag)
58
Q

double embedded

A
  • layer 1: Markov model
  • layer 2: choose observation
59
Q

forward, forward-backward algorithm

A
60
Q

backward, forward-backwards algorithm

A
61
Q

forward backwards code

A
62
Q

viterbi algorithm

A
  • find the most probable hidden state sequence given the observed sequence
  • the same as forward backward algorithm except taking the max instead of sum
    • delta variable is like alpha
    • psi is similar to delta but argmax instead and no B term
  • have to backtrack the states
63
Q

vertirbi initialization

A
64
Q

vertirbi recursion

A
65
Q

vertirbi termination

A
66
Q

Baum-Welch algorithm

A
  • Is similar to Gaussian mixture models
  • Use expectation maximization (iterative algorithm) instead of maximum likelihood
  • review at some other date, it is a very expensive algorithm and should be avoided if possible
67
Q

variational autoencoder

A
  • Bayersian ML + Deep Learning
68
Q

generative adversarial network (GAN)

A
  • 2 networks competing against eachother
    *
69
Q

generative versus discriminitive

A
  • generative: P(X|Y)
  • discriminitive: P(Y|X)
    • hard to tell what is happening with weights
    • not satisfactorial to statistician
  • research shows that discriminitive works better
70
Q

types of recommender systems

A
  • autorec: uses autoencoder to recommend missing data
  • non-personalized:
    • shows popular items
      • confidence
    • takes time into account
    • product associations
  • collaborative filter: what YOU and what OTHERS similar to you like
71
Q

matrix factorization

A
  • SVD example is an: X=WSUT
    • S is diagonal matrix of the variances
    • more variance = more important
  • we only want square error where rating is known
  • break up X into components for less parameters stored
  • use K size ~50-100
  • Aij=log(Xij)~=wiTu
72
Q

account for age in recommender system?

A
  • hacker news: penalty*(ups-downs-1)0.8/(age+2)gravity
    • gravity = 1.8
    • age in hours
    • business rules:
      • penalize self, controversial and popular websites posts
  • reddit: sign(ups-downs)log{max(1, |ups-downs|)}+age/45000
    • age in seconds
73
Q

ranking with big systems

A
  • discard users w/ few movies in common
  • sum over nearest neighbors
    • take ones with higheset weights
    • say k = 25 to 50
    • closest absolute correlation
    • closest raw correlation
    • or use everybody
74
Q

training matrix factorization

A
  • J = ∑(tij-wiTuj)
  • dJ/dwi = ∑(tij-wiTuj)uj=∑ujujTwi
    • wi = (∑ujujT)-1∑rijujT
    • j is of Ψ
  • dJ/duj = ∑(tij-wiTuj)uj=∑ujujTwi
    • uj = (∑wjwjT)-1∑rijwi
    • j is of Ω
  • but really, rij=wiTuj+bi+cj
    • cj = rij-wiTuj+bi
    • bi = rij-wiTuj+cj
    • µ = global average