06 - Automatic Speech Recognition Flashcards
What makes ASR difficult (name two)?
Size, speaker, acoustic environment, style, accent/dialect, languages (amount of data etc.)
What is ASR built from in two concepts (think front end and back end)?
A front-end feature extraction and a back-end classification.
ASR is not a sequence2sequence problem. True or false?
False
What is the main trend in the third generation (3G) of ASR?
Hierarchical modelling of speech.
AM –> LM –> PM (using HMM and GMM)
What is the main trend in the fourth generation (4G) of ASR?
End2end. Direct mapping from acoustics to words/characters.
What is the most important metric in ASR?
WER = Word Error Rate
What is the equation for the Word Error Rate (WER)?
WER = (I + S + D) / N
Simple:
I = Insertion
S = Substitution
D = Deletion
I: The number of insertion errors, which represents the additional or extra words inserted in the output compared to the reference (ground truth).
S: The number of substitution errors, which indicates the words that are substituted or replaced in the output compared to the reference.
D: The number of deletion errors, which represents the words that are missing or deleted in the output compared to the reference.
N: The total number of words in the reference (ground truth) transcript.
What are the two building blocks of the ‘conventional’ build of an Acoustic Model in the ASR?
It is the HMM (Hidden Markov Models) and the GMM (Gaussian Mixture Models)
What are the three basic HMM problems?
- Evaluation/scoring (forward algorithm)
- Decoding (viterbi algorithm)
- Training (baum-welch algorithm)
What does the ‘Forward Algorithm’ do?
It permits computing the probability P(X|Ø) based on dynamic programming in O(N^2) time.
What is the ‘Viterbi Algorithm’ used for?
It is used to find the best state sequence.
What does the ‘Baum-Welch algorithm’ do?
It makes use of the forward probabilites and the backward probabilities to estimate the state occupation probability.
What probability does the ‘Acoustic Model’ calculate?
P(X|W). That is, the probability of a vector sequence, X given an utterance, W.
What does the language Model calculate the probability of?
P(W), that is. The probability of a given utterance, W.
What is CFG?
Context Free Grammar