Unsupervised Learning Flashcards

Question

kernel density estimation

Answer 1

fitting PDFs with kernels; can use GMM

Answer 2

* general version of GMM * latent variables * coordinated descent * increasing likelihood guarantee (max likelihood)

Answer 3

1) cleaner signal; 2) decreased transformation load -\> reduced time; 3) allow visualization of data

Answer 4

* Z=XQ * rotates the data (changes eigenvector and values), doesn't change the length * high variance = signal * low variance = noise * keep left most columns of Z (~95 %)

Answer 5

1. cleaner signal 2. decreased transformation load -\> reduced time 3. allow visualization of data

Answer 6

PCA makes IID assumption true

Answer 7

* helps with supervised learning * only objective of layer is to train itself * only requires a few epochs of training to fine-tune

Answer 8

* J=|X-s(S(XW)W.T)|\*\*2 * non-linear PCA * linear activation gives PCA * predicts itself (auto=self) * square error for x \< 0 and x \> 0 * cross-entropy works for 0 ≤ x ≤ 1 * different biases for hidden and output layers * shared weights (type of regularization

Answer 9

1. Denoising autoencoder 2. Sparse autoencoder 3. Variational autoencoder (VAE) 4. Contractive autoencoder (CAE)

Answer 10

* find useful and efficient representation * compress V dimension into D in the encoding phase

Answer 11

* we only care about the "inner" representation * discard the 2nd half each time * each layer smaller in size 1. train an autoencoder on X with HL output Z; 2. train a 2nd autoencoder on Z; 3. repeat

Answer 12

* an autoencoder that is missing information * attempt to address identity-function risk by randomly corrupting input * its like training with dropout * missing data not always represented by 0 * mask the cost in back propagation

Answer 13

* when you train an autoencoder, the hidden units in the middle layer would fire (activate) too frequently, for most training samples * sparsity constraint: lower the activation rate so that they only activate for a small fraction of the training examples * sparse because each unit only activates to a certain type of inputs

Answer 14

* use autoencode or the likes * *

Answer 15

* google, youtube, netflix * use RBMs and autoencoders *

Answer 16

* Recommender * N is instances * M is movies * K is latent variables * Typical Deep Learning * N is instances (same) * M is hidden units * K is number of classes * D is input dimensions

Answer 17

* everything is connected to everything * only works on trivial problems, unlike RBM * non-tractable

Answer 18

* bipartite graph * produces visible vector from hidden and vice versa * assume each hidden and output is IID * use sigmoid (to keep output 0-1) * act just like autoencoders * find latent variables * greedy pretraining * non-tractable, therefore, needs to be approximated hidden units only connected to the visible unit * square error for all X * cross-entropy works for 0 ≤ x ≤ 1

Answer 19

* h = M * V = D x K * D is number of movies * K is the number of classes * W = D x K x M * b = D x K * x = M

Answer 20

deep network of stacked RBMs

Answer 21

variable that is hidden

Answer 22

* graph of states * each node is a state * state of each node only relies on adjacent node

Answer 23

in RBM, 1. threshold 2. let the values go to whatever

Answer 24

# * unsupervised learning method * models sequences & classifies * modified using bayes rules to create separate model for each class * choose class with the highest likelihood

Answer 25

1. recognizing male or female voice 2. NLP: macbook was created by "p(x|x-, x--)" 3. stock predicition 4. SEO and bounce rate optimization 5. speech to text: words to sound

Answer 26

* p(x | all time) = p(x | x-) * strong assumption

Answer 27

* chain rule of probability * p(s3, s2, s1) = p(s3|s2)p(s2|s1)p(s1)

Answer 28

* 1st order - p(y2 | y1) * 2nd order - p(y3 | y2, y1) * 3rd order - p(y4 | y3, y2, y1)

Answer 29

* M states * π = 1 x M, column vector * A = M x M weights * A(i, :).sum() = 1

Answer 30

* add smoothing to account for unobserved * p = (count(x)+λ)/(N+λµ₀) * can smooth average ratings too * r = (∑x_i+λµ₀)/(N+λ)

Answer 31

1. π_i = probability of starting at state i 2. A(i, j) 1. state transition matrix 2. probability of going to state j from i 3. P(tag₁|tag₀) 4. if row sum to 1, A is markov matrix 3. B(j, k) 1. observation probability matrix 2. probability of observing k given j 3. p(word|tag)

Answer 32

* layer 1: Markov model * layer 2: choose observation

Answer 33

* find the most probable hidden state sequence given the observed sequence * the same as forward backward algorithm except taking the max instead of sum * delta variable is like alpha * psi is similar to delta but argmax instead and no B term * have to backtrack the states

Answer 34

* Is similar to Gaussian mixture models * Use expectation maximization (iterative algorithm) instead of maximum likelihood * review at some other date, it is a very expensive algorithm and should be avoided if possible

Answer 35

* Bayersian ML + Deep Learning

Answer 36

* 2 networks competing against eachother *

Answer 37

* generative: P(X|Y) * discriminitive: P(Y|X) * hard to tell what is happening with weights * not satisfactorial to statistician * research shows that discriminitive works better

Answer 38

* autorec: uses autoencoder to recommend missing data * non-personalized: * shows popular items * confidence * takes time into account * product associations * collaborative filter: what YOU and what OTHERS similar to you like

Answer 39

* SVD example is an: X=WSU^T * S is diagonal matrix of the variances * more variance = more important * we only want square error where rating is known * break up X into components for less parameters stored * use K size ~50-100 * A_ij=log(X_ij)~=w_i^Tu

Answer 40

* hacker news: penalty\*(ups-downs-1)^0.8/(age+2)^gravity * gravity = 1.8 * age in hours * business rules: * penalize self, controversial and popular websites posts * reddit: sign(ups-downs)log{max(1, |ups-downs|)}+age/45000 * age in seconds

Answer 41

* discard users w/ few movies in common * sum over nearest neighbors * take ones with higheset weights * say k = 25 to 50 * closest absolute correlation * closest raw correlation * or use everybody

Answer 42

* J = ∑(t_ij-w_i^Tu_j) * dJ/dw_i = ∑(t_ij-w_i^Tu_j)u_j=∑u_ju_j^Tw_i * w_i = (∑u_ju_j^T)^-1∑r_iju_j^T * j is of Ψ * dJ/du_j = ∑(t_ij-w_i^Tu_j)u_j=∑u_ju_j^Tw_i * u_j = (∑w_jw_j^T)^-1∑r_ijw_i * j is of Ω * but really, r_ij=w_i^Tu_j+b_i+c_j+µ * c_j = r_ij-w_i^Tu_j+b_i+µ * b_i = r_ij-w_i^Tu_j+c_j+µ * µ = global average

Unsupervised Learning Flashcards

(74 cards)