W8-12 Flashcards

Question 1

Q

Def: First-order Markov models vs higher-order (i.e. second-order) Markov models

Answer

A

A first-order Markov chain of observations {x(subscript t)} in which the distribution of particular observation {x(subscript t)} s conditioned on the value of the previous observation {x(subscript t - 1)}

A second-order Markov chain, in which the conditional distribution of a particular observation {x(subscript t)} depends on the values of the two previous observations {x(subscript t - 1)} and {x(subscript t - 2)}

Question 2

Q

What is the process of learning Markov models?

Answer

A

You just need to count and calculate the transition probabilities based on training data you have

Question 3

Q

Markov Models: Consider if you apply 2nd order Markov models to predict the next English word, there are many three-word sequences you may not have observed training data but only in the test phase
During training you may need to save some probabilities for unseen cases, which is called ______.

Answer

A

smoothing

Question 4

Q

What are the following Markov model parameters?
Start/ Initial Probability
State transition probability
Emission probability

Answer

A

Start/ Initial (self explanitory)
State transition - the probability of switching from one hidden state to another
Emission - The probability of observing x of a certain state (i.e. for dice room probability of rolling a 1 fair vs loaded)

Question 5

Q

Why is the key assumption of Hidden Markov Models?

Answer

A

The key independence property is that the hidden variables {z(subscript t - 1)} and {z(subscript t + 1)} are conditional independent given {z(subscript t)}

Question 6

Q

In hidden Markov Models, what happens if hidden states are not given?

Answer

A

Unlike in (non-hidden_ Markov models, observation x(subscript t) depends on all previous x(subscript i) (i

Question 7

Q

What are the three main problems of HMMs (unsupervised)?

Answer

A

Decoding: GIVEN an HMM specified by θ, and a sequence x, FIND the sequence z that maximizes p(z|x,θ),
i. e., z* = argmax(sub z) p(z|x,θ)
Evaluation : GIVEN an HMM specified by θ, and a sequence x, FIND p(x|θ) = Sum(sub z) p(x, z|θ)
Learning: GIVEN a set of observed sequences X, FIND parameters θ that maximize p(X|θ),
i. e. θ* = argmax(subθ) p(X|θ)

Question 8

Q

HMM: What is the naive method of finding z* and what is the problem with this method?

Answer

A

The naïve method of finding z* is to compute all p(x,z|θ) for all values that z can takes
Problem: Exponential numbers of z

Question 9

Q

What is the main idea of Decoding called? and what kind of programming is this?

Answer

A

“Bugs on a Grid”

Dynamic programming

Question 10

Q

What are the “Bugs on a Grid” steps for decoding?

Answer

A

put bugs in each state at t=1, holding values of initial probabilities.
move each bug forward to time t+1 by making copies of each of them to all K node at t+1 and incrementing the value of each copy by the probability of the transition and output emission.
at each node of time t+1, only keep one copy of K incoming bugs which carries the largest probability.
go to 2 until bugs have reached time n
find the bug carrying the largest probability at time n, get the path that winning bug has passed

Question 11

Q

What is the time complexity and space (big O) of the Viterbi Algorithm?

Answer

A

Time complexity: O((K^2)n)
Space O(Kn)

Question 12

Q

Underflows are a significant problem of the Viterbi algorithm. What is the solution?

Answer

A

p(x1,…,xn,z1,…,zn) - these numbers become extremely small - underflow
Solution: Take the logs of all values
Vk’(t+1) = log (ek’, xt+1) + maxk(Vk(t) + log ak,k’)

Question 13

Q

What are the steps that are different when doing the “Bugs on a Grid” algorithm for Evaluation? What is another name for this algorithm?

Answer

A

Steps 3 and 5 change:

put bugs in each state at t=1, holding values of initial probabilities.
move each bug forward to time t+1 by making copies of each of them to all K node at t+1 and incrementing the value of each copy by the probability of the transition and output emission.
at each node of time t+1, replace all the bugs with a single bug carrying the sum of their values
go to 2 until all bugs have reached time n
sum up values on all bugs.

AKA “forward algorithm”

Question 14

Q

What is the running time, and space required, for Forward and Backward algorithms?

Answer

A

Time complexity: O((K^2)n)
Space O(Kn)

Question 15

Q

What is the posterior decoding calculation calculate?

Answer

A

What is the mst likely state at tiem t given sequence x?
argmaxkP(zt= k | x)

(forward algorithm and backwards algorithm allow us to calculate)

p(zt= k|x) = fk(t) bk(t)/ p(x)

Question 16

Q

Problem 3 of HMM Learning requires that parameters θ that maximize p(X|θ) are found. How can we compute the distribution of hidden states over paths if an HMM’s parameters are known?

Answer

A

Use the EM algorithm

initialize HMM parameters
E STEP: Compute useful distributions about hidden states, which will be used to estimate model parameters in the M step
M step: Compute max likelihood estimates for model parameters, as hidden states distributions are known
repeat E step and M step until convergence criterion is satisfied (e.g., when likelihood p(X|θ) does not increase too much

Question 17

Q

How do we do the EM algorithm efficiently?

Answer

A

Use the “Bugs of a grid” dynamic programming trick
- Initialize HMM parameters θ =(pi,A,E)
- given a training sequence x = {x1, x2, …, xn}
E STEP: Compute useful distribution about hidden states that will be used to estimate model parameters later in the M step
1. compute p(zt= k|x) for all k = 1,…,K (calculate using forward-backward algorithm)
2. Compute p(zt=k,zt+1=k’|x) = fk(t) ak,k’ek’,xt+1bk(t) / p(x)
M STEP: Compute max likelihood estimates for model parameters, given hidden state distribution is known
1. Estimate initial probability
2. Estimate Transition probability
3. Estimate Emission Probability

Question 18

Q

Define: Deep Learning

Answer

A

A set of machine learning algorithms that model high-level abstractions in data using artificial neural networks

Question 19

Q

Why deep learning now?

Answer

A

Computers finally have computation power

Question 20

Q

Neuro net algorithms are inspired by ___

Answer

A

the brain

Question 21

Q

What does a Linear Neuron calculate?

Answer

A

A linear neuron just computes a weighted sum of input features

Question 22

Q

What is a multi layer of linear neurons?

Answer

A

Still a linear model

Question 23

Q

What does a Binary Threshold Neron do?

Answer

A

First compute weighted sum of the inputs

- then send out a fixed size spike of activity if the weighted sum exceeds a threshold

Question 24

Q

What does a Rectified Linear Neuron calculate?

Answer

A

They compute a weighted sum of their inputs. Then it is passed to a rectified linear unit. (sometimes called linear threshold neurons)

Question 25

Q

We have shown that the ____________ is relevant to “intellegence” of systems or living beings.

What is the other factor that is important?

Answer

A

Number of neurons

structure (even if computer has same number as a human, lacks the common sense, can’t reason)

Question 26

Q

What is a Feed Forward Neural Network?

Answer

A

Most common type of neural network for practical applications
- first layer input
- last layer output
Compute a series of transformations

Question 27

Q

If there is more than one layer of a Feed Fforward neural Network, we call them ____

Answer

A

“deep” neural networks

Question 28

Q

What are the activities of the neurons in each layer of a feed forward neural network?

Answer

A

a non-linear function of the activities in the layer below

Question 29

Q

Feed Forward networks are fully connected

True or False?

Question 30

Q

It has been proven that networks with one hidden layer is enough to represent an approximation of any function to an arbitrary degree of accuracy
True or False?

Question 31

Q

What are Large, Shallow models more prone to?

Answer

A

Overfit

convolutional networks (sparsely connected feed-forward) are shallow models - overfit around 20 million parameters while deep ones can benefit from having over 60 million and for the same number of parameters, deep models achieve better performance than shallow ones

Question 32

Q

Why do we want a deep network rather than that with only one layer?

Answer

A

Shallow networks may overfit more

Deep net often achieve better performance with same number of parameters

Question 33

Q

How deep do we want our neural nets?

Answer

A

task dependent (no universal answer)

For some problems, the depth of models are naturally given by the problems (i.e., in recurrent neural nets a network ofen has a depth “by time” (number of steps))

Question 34

Q

What does the Universal Aproximator Theorem state?

Answer

A

1 hidden layer is enough to represnt an approximation of any function to any arbitrary degree of accuracy

Question 35

Q

How many layers classifies a shallow model?

Answer

A

3 layers or less

Question 36

Q

What is the problem with this method of training Feed Forward Nets? :
Randomly perturb one weight and see if it improves performance. If so, save the change.

Answer

A

Very inefficient. Need to do multiple forward passes on a representative set of training data just to change one weight.

Question 37

Q

What is the problem with this method of training Feed Forward Nets? :
Randomly perturb all the weights in parallel and correlate the performance gain with the weight changes

Answer

A

It is still hard to figure out on which weight(s) the change improves performance

Question 38

Q

When training feed forward neral nets we don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity. What is the key idea to doing this?

Answer

A

Minimize an error function by computing the derivative w.r.t. model parameters and set that to be zero ‘
(FF neral nets, Derivatives are computed with an algorithm called: BACKWARD PROPOGATION AKA BACKPROP)
Once you get derivatives, there are many optimization algorithms you can chose to use to find good parameters

Question 39

Q

How do you train Feed-Norward networks using forward and backward propogation? (4 steps)

Answer

A

Forward propogation. Compute hidden units and output
Compute loss/errors
Back-propogation. Compute derivatives
Use some existing optimization algorithm to find good parameters (those minimizing errors)

Question 40

Q

In nerual networks optimization algorithms what is not guaranteed?

Answer

A

Do not guarantee find globally optimal solution (often, the error function is not convex and may have many local optimums)

Question 41

Q

How do you check the correctness of your backprop implementation?

Answer

A

Suppose we have implemted backprop to compute dE/dθ
If the backprop is implemented correctly we should have
dE/dθ ~= (E(θ+e)-E(θ-e))/2e

In neural network, θ is a vector that includes the weights w of all layers (including biases), The outline of the checking algorithm is:

we randomly pick a θ, and then compute E(θ+e) and E(θ-e) with forward propagation
we also compute E(θ) with forward propogation and based on that compute dE/dθ with backprop
check if the equation above is satisfied or not

Question 42

Q

What are the two types of memoryless models for sequences?

Answer

A

Autoregressive models: Predict the next term in a sequence from a fixed number of previous terms (i.e. Markov Models)

Feed-Forward neural nets: generalize autoregressive models by using one or more layers of non-linear hidden units

(both cases: no memory about longer history)

Question 43

Q

When hidden Markov model generates data. Ad each time step it must select one of its hidden states. So with k hidden states it can only rememer ——– bits about what it generated so far

Question 44

Q

What does the information that the first half of a spoken sentence of an HMM contains about the second half?

Answer

A

The syntax needs to fit (e.g. number and tense agreement)
The semantics needs to fit. The intonation needs to fit.
The accent, rate, volume, and vocal tract characteristics must all fit

All these aspects combined could be 100 bits of information that the first half of an utterance needs to convey to the second half. 2^100 is big!!!

Question 45

Q

What two properties makes Recurrent Neural networks more powerful?

Answer

A

Distributed hidden state that allows them to store a lot of information about the past efficiently
non-linear dynamics that allows them to update their hidden state in complicated ways

Question 46

Q

RNNs: all time steps share the same — and — functions

Answer

A

transition and emission functions

Question 47

Q

How can we think of the recurrent net?

Answer

A

as a layered, feed-forward net with shared weights and then train the feed-forward net with weight constraints

Question 48

Q

How can we think of the backprop of a RNN training algorithm in the time domain?

Answer

A

the forward pass builds up a stack of the activities of all the units at each time step
the backward pass peels activities off the stack to compute the error derivatives at each time step
after the backward pass we add together the derivatives at all the different times for each weight

Question 49

Q

What are the RNN learning restrictions?

Answer

A

The automaton is restricted to be exactly one state at each time. The hidden units are restricted to have exactly one vector of activity at each time.

Question 50

Q

If RNN is more powerful than HMM why has HMM been used for many years?

Answer

A

It is difficult to train RNN: time consuming bout also because of a phenomenon called vanishing/exploding gradient

Question 51

Q

RNNS. What happens to the magnitude of the gradients as we backpropogate through many layers?

Answer

A

If the weights are small, gradients shrink exponentially
If the weights are big the gradients grow exponentially
in addition, activation function can also cause vanishing gradients

(if the network only has a few hidden layers, this won’t make too much trouble)

Question 52

Q

RNN is often trained on long sequences and the gradients can easily explode or vanish, and RNN —- ——, which can make the problem be serious.

What do RNNs have difficulty daling with as a result?

Answer

A

share parameters

RNNs have difficulty dealing with long-range dependencies

Question 53

Q

What is (the most famous) solution to vanishing/exploding gradients?

Answer

A

The design of specific types of RNN architecture -> Long Short-Term Memory (LSTM) - uses gates to control information flow in RNN

(Another less widely used is gated recurrent unit)

Question 54

Q

What is the purpose of the input gate and the forget gate in the Long Short-Term Memory model?

Answer

A

The input gate decides how much information is allowed to get in memory cell from the input
The Forget gate (should actually be called the memory gate) decides how much information is allowed from the previous memory

Question 55

Q

What is the purpose of the sigmoid function in the Long Short-Term Memory model?

Answer

A

Ensures the gate values are between 0 and 1

Question 56

Q

Convolutional Nerual Networks (LeNet)…
Yann LeCun and others developed a really good recognizer for handwritten digits by using backpropogation in a feedforward net with:???

Answer

A

many hidden layers
use filters of replicated units to share parameters
use subsampling operation

Question 57

Q

What is the convolution state of CNN?

Answer

A

The first key component of CNN.
A filter shifts on input to detect local features. At the convolutional layer of a CNN, we often have multiple numbers of such filters (each has its own parameters)/

Question 58

Q

What is the difference between convolution and fully connected?

Answer

A

Convolution has fewer parameters! al
(i.e. in the example from the notes/ question the first output is only connected to 9 inputs not fully connected to all 36)
also has shared weights

Question 59

Q

Does convolution share the same parameters across al spatial locations? What does traditional matrix multiplication do?

Answer

A

Yes, traditional mm does not share any parameters

Question 60

Q

What is subsampling used for? (what are the 3 reasons it is used)

Answer

A

Can subsample pixels to make image smaller, so (1) fewer parameters are used to characterize the image, (2) computation is less expensive, and (3) the view of each filter at an upper layer can see a wide range.

Question 61

Q

Max pooling introduces some — for image translation.

Answer

A

invariance

Question 62

Q

What is another widely-used pooling, instead of taking max that is used?

Answer

A

Average pooling - to take average of an area

Question 63

Q

What is used in CNN after convolutions and subsampling?

Answer

A

fully connected layers

Question 64

Q

How does CNN control the number of model parameters?

Answer

A

reducing number of connections
shared weights on the edges
pooling further reduces the complexity

Answer 62

A

~14 million labeled images, 20k classes

used for CNN learning

Answer 63

A

for a pooling area with N-by_N input units, you just need to backprop gradient to the unit that has the max value (the winning unit)

why? if you make a small change in other non-max units, it will not change the pooling results, indicating gradient is zero w.r.t. that non-winning unit

Answer 64

A

take derivative w.r.t. the corresponding units (i.e. the error is multiplied by 1/N^2 and assigned to the whole pooling area (all units get this same gradient value))

Answer 65

A

connectivity
weight constraints
neuron activation functions

Less intrusive then hand-designing the features \

Alternatively can use prior knowledge to create a whole lot more training data

Answer 66

A

dropout

works very well for different neural network architectures
Can also be beneficial for input layer but with a higher probability of keeping input unit

Answer 67

A

better performance

Answer 68

A

(dropout as a form of model averaging) Dropout can be seen as sampling from different “thinned” architectures (each hidden layer now use different nomber of hidden units) —> at test time could sample many different architectures and take the average of their output distributions

best to use all hidden units, but to halve their outgoing weights
we sample from 2^H models
the sharing of the weights means that every model is very strongly regularized
much better than Lasso or squared penalities that pull the weights towards zero

Answer 69

A

usually reduce the number of errors by a lot

Answer 70

A

Falso - can do better

at the cost of taking quite a lot longer to train

Answer 71

A

a neural network trained to copy its input to its output
Traditionally used for dimentionality reduction as well as feature learning

hoping that training it will result in taking on useful properties

Answer 72

A

design a feed forward net trying to make the output be the same as the input with a central code layer
hidden and output layers linear, will learn hidden units that are a linear function of the data and minimize the squared reconstruction error
M hidden units will span the space as the first M components found by PCA, although they are not exactly the same vectors

Answer 73

A

encoder

decoder

Answer 74

A

Undercomplete: the code layer h has smaller dimension than input vector x (more widely used)
overcomplete: the code layer h has larger dimension than input vector x

Answer 75

A

the autoencoder can learn to perform the copying task without extracting useful information about the distribution of the data
- useful autoencoders should not copy perfectly but ristricted by design to copy only approximately (in doing so learns useful properties of the data to reconstruct the data)

Answer 76

A

sparsity of representation
smallness of the derivative of the representation
robustness of noise of missing inputs

Answer 77

A

one that recieves a corrupted data point as input and is trained to predict the original, uncorrupted data point as its output

Answer 78

A

choose a training sample from the training data
obtain corrupted version from curruption process
use training sample pair to estimate reconstruction

W8-12 Flashcards

New Material Included on the Exam