W8-12 Flashcards
New Material Included on the Exam
Def: First-order Markov models vs higher-order (i.e. second-order) Markov models
A first-order Markov chain of observations {x(subscript t)} in which the distribution of particular observation {x(subscript t)} s conditioned on the value of the previous observation {x(subscript t - 1)}
A second-order Markov chain, in which the conditional distribution of a particular observation {x(subscript t)} depends on the values of the two previous observations {x(subscript t - 1)} and {x(subscript t - 2)}
What is the process of learning Markov models?
You just need to count and calculate the transition probabilities based on training data you have
Markov Models: Consider if you apply 2nd order Markov models to predict the next English word, there are many three-word sequences you may not have observed training data but only in the test phase
During training you may need to save some probabilities for unseen cases, which is called ______.
smoothing
What are the following Markov model parameters?
Start/ Initial Probability
State transition probability
Emission probability
Start/ Initial (self explanitory)
State transition - the probability of switching from one hidden state to another
Emission - The probability of observing x of a certain state (i.e. for dice room probability of rolling a 1 fair vs loaded)
Why is the key assumption of Hidden Markov Models?
The key independence property is that the hidden variables {z(subscript t - 1)} and {z(subscript t + 1)} are conditional independent given {z(subscript t)}
In hidden Markov Models, what happens if hidden states are not given?
Unlike in (non-hidden_ Markov models, observation x(subscript t) depends on all previous x(subscript i) (i
What are the three main problems of HMMs (unsupervised)?
- Decoding: GIVEN an HMM specified by θ, and a sequence x, FIND the sequence z that maximizes p(z|x,θ),
i. e., z* = argmax(sub z) p(z|x,θ) - Evaluation : GIVEN an HMM specified by θ, and a sequence x, FIND p(x|θ) = Sum(sub z) p(x, z|θ)
- Learning: GIVEN a set of observed sequences X, FIND parameters θ that maximize p(X|θ),
i. e. θ* = argmax(subθ) p(X|θ)
HMM: What is the naive method of finding z* and what is the problem with this method?
The naïve method of finding z* is to compute all p(x,z|θ) for all values that z can takes
Problem: Exponential numbers of z
What is the main idea of Decoding called? and what kind of programming is this?
“Bugs on a Grid”
Dynamic programming
What are the “Bugs on a Grid” steps for decoding?
- put bugs in each state at t=1, holding values of initial probabilities.
- move each bug forward to time t+1 by making copies of each of them to all K node at t+1 and incrementing the value of each copy by the probability of the transition and output emission.
- at each node of time t+1, only keep one copy of K incoming bugs which carries the largest probability.
- go to 2 until bugs have reached time n
- find the bug carrying the largest probability at time n, get the path that winning bug has passed
What is the time complexity and space (big O) of the Viterbi Algorithm?
Time complexity: O((K^2)n) Space O(Kn)
Underflows are a significant problem of the Viterbi algorithm. What is the solution?
p(x1,…,xn,z1,…,zn) - these numbers become extremely small - underflow
Solution: Take the logs of all values
Vk’(t+1) = log (ek’, xt+1) + maxk(Vk(t) + log ak,k’)
What are the steps that are different when doing the “Bugs on a Grid” algorithm for Evaluation? What is another name for this algorithm?
Steps 3 and 5 change:
- put bugs in each state at t=1, holding values of initial probabilities.
- move each bug forward to time t+1 by making copies of each of them to all K node at t+1 and incrementing the value of each copy by the probability of the transition and output emission.
- at each node of time t+1, replace all the bugs with a single bug carrying the sum of their values
- go to 2 until all bugs have reached time n
- sum up values on all bugs.
AKA “forward algorithm”
What is the running time, and space required, for Forward and Backward algorithms?
Time complexity: O((K^2)n) Space O(Kn)
What is the posterior decoding calculation calculate?
What is the mst likely state at tiem t given sequence x?
argmaxkP(zt= k | x)
(forward algorithm and backwards algorithm allow us to calculate)
p(zt= k|x) = fk(t) bk(t)/ p(x)
Problem 3 of HMM Learning requires that parameters θ that maximize p(X|θ) are found. How can we compute the distribution of hidden states over paths if an HMM’s parameters are known?
Use the EM algorithm
- initialize HMM parameters
- E STEP: Compute useful distributions about hidden states, which will be used to estimate model parameters in the M step
- M step: Compute max likelihood estimates for model parameters, as hidden states distributions are known
- repeat E step and M step until convergence criterion is satisfied (e.g., when likelihood p(X|θ) does not increase too much
How do we do the EM algorithm efficiently?
Use the “Bugs of a grid” dynamic programming trick
- Initialize HMM parameters θ =(pi,A,E)
- given a training sequence x = {x1, x2, …, xn}
E STEP: Compute useful distribution about hidden states that will be used to estimate model parameters later in the M step
1. compute p(zt= k|x) for all k = 1,…,K (calculate using forward-backward algorithm)
2. Compute p(zt=k,zt+1=k’|x) = fk(t) ak,k’ek’,xt+1bk(t) / p(x)
M STEP: Compute max likelihood estimates for model parameters, given hidden state distribution is known
1. Estimate initial probability
2. Estimate Transition probability
3. Estimate Emission Probability
Define: Deep Learning
A set of machine learning algorithms that model high-level abstractions in data using artificial neural networks
Why deep learning now?
Computers finally have computation power
Neuro net algorithms are inspired by ___
the brain
What does a Linear Neuron calculate?
A linear neuron just computes a weighted sum of input features
What is a multi layer of linear neurons?
Still a linear model
What does a Binary Threshold Neron do?
- First compute weighted sum of the inputs
- then send out a fixed size spike of activity if the weighted sum exceeds a threshold
What does a Rectified Linear Neuron calculate?
They compute a weighted sum of their inputs. Then it is passed to a rectified linear unit. (sometimes called linear threshold neurons)
We have shown that the ____________ is relevant to “intellegence” of systems or living beings.
What is the other factor that is important?
Number of neurons
structure (even if computer has same number as a human, lacks the common sense, can’t reason)
What is a Feed Forward Neural Network?
Most common type of neural network for practical applications
- first layer input
- last layer output
Compute a series of transformations
If there is more than one layer of a Feed Fforward neural Network, we call them ____
“deep” neural networks
What are the activities of the neurons in each layer of a feed forward neural network?
a non-linear function of the activities in the layer below
Feed Forward networks are fully connected
True or False?
True
It has been proven that networks with one hidden layer is enough to represent an approximation of any function to an arbitrary degree of accuracy
True or False?
True
What are Large, Shallow models more prone to?
Overfit
convolutional networks (sparsely connected feed-forward) are shallow models - overfit around 20 million parameters while deep ones can benefit from having over 60 million and for the same number of parameters, deep models achieve better performance than shallow ones
Why do we want a deep network rather than that with only one layer?
Shallow networks may overfit more
Deep net often achieve better performance with same number of parameters