06 - Automatic Speech Recognition Flashcards
What makes ASR difficult (name two)?
Size, speaker, acoustic environment, style, accent/dialect, languages (amount of data etc.)
What is ASR built from in two concepts (think front end and back end)?
A front-end feature extraction and a back-end classification.
ASR is not a sequence2sequence problem. True or false?
False
What is the main trend in the third generation (3G) of ASR?
Hierarchical modelling of speech.
AM –> LM –> PM (using HMM and GMM)
What is the main trend in the fourth generation (4G) of ASR?
End2end. Direct mapping from acoustics to words/characters.
What is the most important metric in ASR?
WER = Word Error Rate
What is the equation for the Word Error Rate (WER)?
WER = (I + S + D) / N
Simple:
I = Insertion
S = Substitution
D = Deletion
I: The number of insertion errors, which represents the additional or extra words inserted in the output compared to the reference (ground truth).
S: The number of substitution errors, which indicates the words that are substituted or replaced in the output compared to the reference.
D: The number of deletion errors, which represents the words that are missing or deleted in the output compared to the reference.
N: The total number of words in the reference (ground truth) transcript.
What are the two building blocks of the ‘conventional’ build of an Acoustic Model in the ASR?
It is the HMM (Hidden Markov Models) and the GMM (Gaussian Mixture Models)
What are the three basic HMM problems?
- Evaluation/scoring (forward algorithm)
- Decoding (viterbi algorithm)
- Training (baum-welch algorithm)
What does the ‘Forward Algorithm’ do?
It permits computing the probability P(X|Ø) based on dynamic programming in O(N^2) time.
What is the ‘Viterbi Algorithm’ used for?
It is used to find the best state sequence.
What does the ‘Baum-Welch algorithm’ do?
It makes use of the forward probabilites and the backward probabilities to estimate the state occupation probability.
What probability does the ‘Acoustic Model’ calculate?
P(X|W). That is, the probability of a vector sequence, X given an utterance, W.
What does the language Model calculate the probability of?
P(W), that is. The probability of a given utterance, W.
What is CFG?
Context Free Grammar
When do we use CFG instead of n-gram?
When the sentences or phrases are complete and non-complex. Very simple phrases for example.
When do we use n-grams instead of CFG (context free grammar)?
When there are comlex and maybe incoherent senteces or phrases. It is used when word sequence probabilities are important, e.g. in text prediction and speech classification.
What do we use the ‘Pronunciation model’ for?
To find the optimal, W. That is utterance.
We cannot perform continuous speech recognition with HMM’s. True or false?
False. We can and do.
What does ‘LVCSR’ stand for?
Large Vocabulary Continuous Speech Recognition
When we turn our attention to more modern approaches, the observation likelihood used in the ‘conventional’ methods is replaced by?
A (scaled) neural network posterier.
Some disadvantes of using Neural Networks until 2012 were?
It was computationally expensive (still is) - moreso back then.
Neural network systems were less complex before then.
What are the differences (progress) in Automatic Speech Recognition after 2012?
We have deeper and wider networks and our computers and GPU’s allow for much faster training.
Time-delay Neural Networks (TDNN’s) model richer content. True or false?
True.
Since GMM/HMM was proposed the approaches were more?
Statistical.
Hierarchical (3G) systems is often split into 3 (sometimes more) concepts. These are types of models, and are commonly what kind of models?
Acoustic model, language model (n-grams) and pronunciation model. Also a decoder sometimes.
We see a huge improvement in Large Vocabulary Continuous Speech Recognition (LVCSR) since 2012 due to what?
The positive improvement and impact of deep learning.
Newer deep learning ASR systems replace the acoustic model of a hierarchical/statistical system by what?
A deep neural network. This gives better accuracy and context modelling.