07 - End-2-End Speech Recognition Flashcards
Let us start with a reminder. What is an end-2-end system or model?
A sytem or model that takes in an input and directly models an output without relying on intermediate stages or processes.
HMM-based problems are normally split into 3 separate models, which are?
The Acoustic Model (AM), the Language Model (LM) and the Pronunciation Model (PM).
HMM is quick and efficient and does not require expert knowledge. True or false?
False. HMM’s require expert knowledge and is very time consuming because of the 3 models that underly in the structure.
In the acoustic model (that is the sound signal input) what are the main issues (there are 3)?
- The input is of variable length.
- The input is often much larger than the output.
- We do not know how the input audio features align with the output characters.
What are the two main architectures used to solve the ‘alignment problem’ in end-2-end models?
The CTC (Connectionist Temporal Classification) and the ‘seq2seq-attention’
The CTC was proposed in 2006 as a model that can train an acoustic model without requiring segmentation and alignment. True or false?
True.
An original approach in speech recognition outputted phonemes and NOT words. Is it still an end-2-end model when the output is phonemes?
NO. It is not the complete process. You would have to transfer the phonemes into words afterwards thus relying on an intermediate process.
The CTC does not allow for identifying consecutive character repetitions. For example in the word ‘hello’ (2 l’s). What special token does CTC introduce to combat this?
The blank (ϵ) token.
The CTC mapping makes use of the blank token (ϵ) to comprehend what?
An instance where a word has two consecutive letters. This is because input and output can have different lengths.
The probability of an alignment in the CTC is what?
The dot-product of the probabilities at each time step.
CTC can perform inference, and uses two search mechanisms. These are?
The greedy search and the beam search
What is the ‘Greedy Search’?
This takes the most likely output at each time-step. It will give us the alignment with the highest probability.
What is the ‘Beam Search’?
It computes a new set of hypotheses at each input step with all possible combinations, but keeps only the top candidates.
All network outputs of the CTC are conditionally dependent. True or false?
False. They are conditionally independent.
Is CTC (used alone) a real end-2-end approach?
No. Because in order to obtain good performance it relies on the use of external language models.