W8-12 Flashcards
New Material Included on the Exam
Def: First-order Markov models vs higher-order (i.e. second-order) Markov models
A first-order Markov chain of observations {x(subscript t)} in which the distribution of particular observation {x(subscript t)} s conditioned on the value of the previous observation {x(subscript t - 1)}
A second-order Markov chain, in which the conditional distribution of a particular observation {x(subscript t)} depends on the values of the two previous observations {x(subscript t - 1)} and {x(subscript t - 2)}
What is the process of learning Markov models?
You just need to count and calculate the transition probabilities based on training data you have
Markov Models: Consider if you apply 2nd order Markov models to predict the next English word, there are many three-word sequences you may not have observed training data but only in the test phase
During training you may need to save some probabilities for unseen cases, which is called ______.
smoothing
What are the following Markov model parameters?
Start/ Initial Probability
State transition probability
Emission probability
Start/ Initial (self explanitory)
State transition - the probability of switching from one hidden state to another
Emission - The probability of observing x of a certain state (i.e. for dice room probability of rolling a 1 fair vs loaded)
Why is the key assumption of Hidden Markov Models?
The key independence property is that the hidden variables {z(subscript t - 1)} and {z(subscript t + 1)} are conditional independent given {z(subscript t)}
In hidden Markov Models, what happens if hidden states are not given?
Unlike in (non-hidden_ Markov models, observation x(subscript t) depends on all previous x(subscript i) (i
What are the three main problems of HMMs (unsupervised)?
- Decoding: GIVEN an HMM specified by θ, and a sequence x, FIND the sequence z that maximizes p(z|x,θ),
i. e., z* = argmax(sub z) p(z|x,θ) - Evaluation : GIVEN an HMM specified by θ, and a sequence x, FIND p(x|θ) = Sum(sub z) p(x, z|θ)
- Learning: GIVEN a set of observed sequences X, FIND parameters θ that maximize p(X|θ),
i. e. θ* = argmax(subθ) p(X|θ)
HMM: What is the naive method of finding z* and what is the problem with this method?
The naïve method of finding z* is to compute all p(x,z|θ) for all values that z can takes
Problem: Exponential numbers of z
What is the main idea of Decoding called? and what kind of programming is this?
“Bugs on a Grid”
Dynamic programming
What are the “Bugs on a Grid” steps for decoding?
- put bugs in each state at t=1, holding values of initial probabilities.
- move each bug forward to time t+1 by making copies of each of them to all K node at t+1 and incrementing the value of each copy by the probability of the transition and output emission.
- at each node of time t+1, only keep one copy of K incoming bugs which carries the largest probability.
- go to 2 until bugs have reached time n
- find the bug carrying the largest probability at time n, get the path that winning bug has passed
What is the time complexity and space (big O) of the Viterbi Algorithm?
Time complexity: O((K^2)n) Space O(Kn)
Underflows are a significant problem of the Viterbi algorithm. What is the solution?
p(x1,…,xn,z1,…,zn) - these numbers become extremely small - underflow
Solution: Take the logs of all values
Vk’(t+1) = log (ek’, xt+1) + maxk(Vk(t) + log ak,k’)
What are the steps that are different when doing the “Bugs on a Grid” algorithm for Evaluation? What is another name for this algorithm?
Steps 3 and 5 change:
- put bugs in each state at t=1, holding values of initial probabilities.
- move each bug forward to time t+1 by making copies of each of them to all K node at t+1 and incrementing the value of each copy by the probability of the transition and output emission.
- at each node of time t+1, replace all the bugs with a single bug carrying the sum of their values
- go to 2 until all bugs have reached time n
- sum up values on all bugs.
AKA “forward algorithm”
What is the running time, and space required, for Forward and Backward algorithms?
Time complexity: O((K^2)n) Space O(Kn)
What is the posterior decoding calculation calculate?
What is the mst likely state at tiem t given sequence x?
argmaxkP(zt= k | x)
(forward algorithm and backwards algorithm allow us to calculate)
p(zt= k|x) = fk(t) bk(t)/ p(x)
Problem 3 of HMM Learning requires that parameters θ that maximize p(X|θ) are found. How can we compute the distribution of hidden states over paths if an HMM’s parameters are known?
Use the EM algorithm
- initialize HMM parameters
- E STEP: Compute useful distributions about hidden states, which will be used to estimate model parameters in the M step
- M step: Compute max likelihood estimates for model parameters, as hidden states distributions are known
- repeat E step and M step until convergence criterion is satisfied (e.g., when likelihood p(X|θ) does not increase too much
How do we do the EM algorithm efficiently?
Use the “Bugs of a grid” dynamic programming trick
- Initialize HMM parameters θ =(pi,A,E)
- given a training sequence x = {x1, x2, …, xn}
E STEP: Compute useful distribution about hidden states that will be used to estimate model parameters later in the M step
1. compute p(zt= k|x) for all k = 1,…,K (calculate using forward-backward algorithm)
2. Compute p(zt=k,zt+1=k’|x) = fk(t) ak,k’ek’,xt+1bk(t) / p(x)
M STEP: Compute max likelihood estimates for model parameters, given hidden state distribution is known
1. Estimate initial probability
2. Estimate Transition probability
3. Estimate Emission Probability
Define: Deep Learning
A set of machine learning algorithms that model high-level abstractions in data using artificial neural networks
Why deep learning now?
Computers finally have computation power
Neuro net algorithms are inspired by ___
the brain
What does a Linear Neuron calculate?
A linear neuron just computes a weighted sum of input features
What is a multi layer of linear neurons?
Still a linear model
What does a Binary Threshold Neron do?
- First compute weighted sum of the inputs
- then send out a fixed size spike of activity if the weighted sum exceeds a threshold
What does a Rectified Linear Neuron calculate?
They compute a weighted sum of their inputs. Then it is passed to a rectified linear unit. (sometimes called linear threshold neurons)
We have shown that the ____________ is relevant to “intellegence” of systems or living beings.
What is the other factor that is important?
Number of neurons
structure (even if computer has same number as a human, lacks the common sense, can’t reason)
What is a Feed Forward Neural Network?
Most common type of neural network for practical applications
- first layer input
- last layer output
Compute a series of transformations
If there is more than one layer of a Feed Fforward neural Network, we call them ____
“deep” neural networks
What are the activities of the neurons in each layer of a feed forward neural network?
a non-linear function of the activities in the layer below
Feed Forward networks are fully connected
True or False?
True
It has been proven that networks with one hidden layer is enough to represent an approximation of any function to an arbitrary degree of accuracy
True or False?
True
What are Large, Shallow models more prone to?
Overfit
convolutional networks (sparsely connected feed-forward) are shallow models - overfit around 20 million parameters while deep ones can benefit from having over 60 million and for the same number of parameters, deep models achieve better performance than shallow ones
Why do we want a deep network rather than that with only one layer?
Shallow networks may overfit more
Deep net often achieve better performance with same number of parameters
How deep do we want our neural nets?
task dependent (no universal answer)
For some problems, the depth of models are naturally given by the problems (i.e., in recurrent neural nets a network ofen has a depth “by time” (number of steps))
What does the Universal Aproximator Theorem state?
1 hidden layer is enough to represnt an approximation of any function to any arbitrary degree of accuracy
How many layers classifies a shallow model?
3 layers or less
What is the problem with this method of training Feed Forward Nets? :
Randomly perturb one weight and see if it improves performance. If so, save the change.
Very inefficient. Need to do multiple forward passes on a representative set of training data just to change one weight.
What is the problem with this method of training Feed Forward Nets? :
Randomly perturb all the weights in parallel and correlate the performance gain with the weight changes
It is still hard to figure out on which weight(s) the change improves performance
When training feed forward neral nets we don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity. What is the key idea to doing this?
Minimize an error function by computing the derivative w.r.t. model parameters and set that to be zero ‘
(FF neral nets, Derivatives are computed with an algorithm called: BACKWARD PROPOGATION AKA BACKPROP)
Once you get derivatives, there are many optimization algorithms you can chose to use to find good parameters
How do you train Feed-Norward networks using forward and backward propogation? (4 steps)
- Forward propogation. Compute hidden units and output
- Compute loss/errors
- Back-propogation. Compute derivatives
- Use some existing optimization algorithm to find good parameters (those minimizing errors)
In nerual networks optimization algorithms what is not guaranteed?
Do not guarantee find globally optimal solution (often, the error function is not convex and may have many local optimums)
How do you check the correctness of your backprop implementation?
- Suppose we have implemted backprop to compute dE/dθ
- If the backprop is implemented correctly we should have
dE/dθ ~= (E(θ+e)-E(θ-e))/2e
In neural network, θ is a vector that includes the weights w of all layers (including biases), The outline of the checking algorithm is:
- we randomly pick a θ, and then compute E(θ+e) and E(θ-e) with forward propagation
- we also compute E(θ) with forward propogation and based on that compute dE/dθ with backprop
- check if the equation above is satisfied or not
What are the two types of memoryless models for sequences?
- Autoregressive models: Predict the next term in a sequence from a fixed number of previous terms (i.e. Markov Models)
Feed-Forward neural nets: generalize autoregressive models by using one or more layers of non-linear hidden units
(both cases: no memory about longer history)
When hidden Markov model generates data. Ad each time step it must select one of its hidden states. So with k hidden states it can only rememer ——– bits about what it generated so far
log(k)
What does the information that the first half of a spoken sentence of an HMM contains about the second half?
- The syntax needs to fit (e.g. number and tense agreement)
- The semantics needs to fit. The intonation needs to fit.
- The accent, rate, volume, and vocal tract characteristics must all fit
All these aspects combined could be 100 bits of information that the first half of an utterance needs to convey to the second half. 2^100 is big!!!
What two properties makes Recurrent Neural networks more powerful?
- Distributed hidden state that allows them to store a lot of information about the past efficiently
- non-linear dynamics that allows them to update their hidden state in complicated ways
RNNs: all time steps share the same — and — functions
transition and emission functions
How can we think of the recurrent net?
as a layered, feed-forward net with shared weights and then train the feed-forward net with weight constraints
How can we think of the backprop of a RNN training algorithm in the time domain?
- the forward pass builds up a stack of the activities of all the units at each time step
- the backward pass peels activities off the stack to compute the error derivatives at each time step
- after the backward pass we add together the derivatives at all the different times for each weight
What are the RNN learning restrictions?
The automaton is restricted to be exactly one state at each time. The hidden units are restricted to have exactly one vector of activity at each time.
If RNN is more powerful than HMM why has HMM been used for many years?
It is difficult to train RNN: time consuming bout also because of a phenomenon called vanishing/exploding gradient
RNNS. What happens to the magnitude of the gradients as we backpropogate through many layers?
- If the weights are small, gradients shrink exponentially
- If the weights are big the gradients grow exponentially
- in addition, activation function can also cause vanishing gradients
(if the network only has a few hidden layers, this won’t make too much trouble)
RNN is often trained on long sequences and the gradients can easily explode or vanish, and RNN —- ——, which can make the problem be serious.
What do RNNs have difficulty daling with as a result?
share parameters
RNNs have difficulty dealing with long-range dependencies
What is (the most famous) solution to vanishing/exploding gradients?
The design of specific types of RNN architecture -> Long Short-Term Memory (LSTM) - uses gates to control information flow in RNN
(Another less widely used is gated recurrent unit)
What is the purpose of the input gate and the forget gate in the Long Short-Term Memory model?
The input gate decides how much information is allowed to get in memory cell from the input
The Forget gate (should actually be called the memory gate) decides how much information is allowed from the previous memory
What is the purpose of the sigmoid function in the Long Short-Term Memory model?
Ensures the gate values are between 0 and 1
Convolutional Nerual Networks (LeNet)…
Yann LeCun and others developed a really good recognizer for handwritten digits by using backpropogation in a feedforward net with:???
- many hidden layers
- use filters of replicated units to share parameters
- use subsampling operation
What is the convolution state of CNN?
The first key component of CNN.
A filter shifts on input to detect local features. At the convolutional layer of a CNN, we often have multiple numbers of such filters (each has its own parameters)/
What is the difference between convolution and fully connected?
Convolution has fewer parameters! al
(i.e. in the example from the notes/ question the first output is only connected to 9 inputs not fully connected to all 36)
also has shared weights
Does convolution share the same parameters across al spatial locations? What does traditional matrix multiplication do?
Yes, traditional mm does not share any parameters
What is subsampling used for? (what are the 3 reasons it is used)
Can subsample pixels to make image smaller, so (1) fewer parameters are used to characterize the image, (2) computation is less expensive, and (3) the view of each filter at an upper layer can see a wide range.
Max pooling introduces some — for image translation.
invariance
What is another widely-used pooling, instead of taking max that is used?
Average pooling - to take average of an area
What is used in CNN after convolutions and subsampling?
fully connected layers
How does CNN control the number of model parameters?
- reducing number of connections
- shared weights on the edges
- pooling further reduces the complexity
What is ImageNet?
~14 million labeled images, 20k classes
used for CNN learning
Training CNN - how is training done for Max pooling subsampling
why?
for a pooling area with N-by_N input units, you just need to backprop gradient to the unit that has the max value (the winning unit)
why? if you make a small change in other non-max units, it will not change the pooling results, indicating gradient is zero w.r.t. that non-winning unit
Training CNN - how is training done for Average pooling subsampling
take derivative w.r.t. the corresponding units (i.e. the error is multiplied by 1/N^2 and assigned to the whole pooling area (all units get this same gradient value))
We can improve the performance of a model by putting prior knowledge about the task into the network by designing appropriate:
What is another way we can use prior knowledge?
- connectivity
- weight constraints
- neuron activation functions
Less intrusive then hand-designing the features \
Alternatively can use prior knowledge to create a whole lot more training data
Consider a neural net with one hidden layer. Each time we present a training example, we randomly omit each hidden unit with probability 0.5. What is this called?
dropout
works very well for different neural network architectures
Can also be beneficial for input layer but with a higher probability of keeping input unit
Training different good models and AVERAGING/assembling the models can often lead to —– ———
better performance
Why dropout?
(dropout as a form of model averaging) Dropout can be seen as sampling from different “thinned” architectures (each hidden layer now use different nomber of hidden units) —> at test time could sample many different architectures and take the average of their output distributions
- best to use all hidden units, but to halve their outgoing weights
- we sample from 2^H models
- the sharing of the weights means that every model is very strongly regularized
- much better than Lasso or squared penalities that pull the weights towards zero
If you deep neural net is significantly overfitting, dropout will ……????
usually reduce the number of errors by a lot
True or False
Any net that uses “early stopping” does worse by using dropout
Falso - can do better
at the cost of taking quite a lot longer to train
True or False
If your deep neural net is not overfitting you should be using a bigger one
True
What is an autoencoder?
a neural network trained to copy its input to its output
Traditionally used for dimentionality reduction as well as feature learning
hoping that training it will result in taking on useful properties
How can we use backprop to implement PCA?
- design a feed forward net trying to make the output be the same as the input with a central code layer
- hidden and output layers linear, will learn hidden units that are a linear function of the data and minimize the squared reconstruction error
- M hidden units will span the space as the first M components found by PCA, although they are not exactly the same vectors
The input-to-hidden part of an autoencoder corresponds to a(n) ———
The hidden-to-output part corresponds to a(n) ————-
encoder
decoder
What are the two types of autoencoders?
Undercomplete: the code layer h has smaller dimension than input vector x (more widely used)
overcomplete: the code layer h has larger dimension than input vector x
If the encoder and decoder are allowed too much capacity what is the risk?
the autoencoder can learn to perform the copying task without extracting useful information about the distribution of the data
- useful autoencoders should not copy perfectly but ristricted by design to copy only approximately (in doing so learns useful properties of the data to reconstruct the data)
What are some ways to design autoencoders to copy only approximately, so that the system learns useful properties of the data
- sparsity of representation
- smallness of the derivative of the representation
- robustness of noise of missing inputs
What is a denoising autoencoder?
- one that recieves a corrupted data point as input and is trained to predict the original, uncorrupted data point as its output
How does a denoising autoencoder learn the reconstructed distribution?
- choose a training sample from the training data
- obtain corrupted version from curruption process
- use training sample pair to estimate reconstruction