Lecture 8 - From HMMs to End-to-End Systems Flashcards
In a Large Vocabulary Speech Contunuous Recognition, how many words are there?
80,000 - 100,000
When creating an ASR, some questions you might ask in designing would be is is constrained, natural speech, small or large vocabulary?
Explain the difference between a small and large vocabulary
Small vocabulary
- Isolated word, each word dedicated acoustic model
Large vocabulary
- Model at sub-word level
- Acoustic models for each phoneme
- Words recognised as sequences of models concatenated.
What is the disadvantage of HMFCCs?
HMFCCs are not noise robust.
When using HMMs for ASR, what can it be considered for?
HMMs can be considered as the acoustic model
Explain the difference between small and large vocabulary when using HMMs as the acoustic model
- Small vocabulary
- Word-level HMM - Large vocabulary
- Phone-level HMM (40 monophones)
- 2-state HM is used to model a phoneme.
- Words built from phonemes.
A problem for HMMs in ASR is that given an observation sequence, how to compute what is the most likely state sequence to produce that observation sequence
What is the solution to this?
Using the viterbi algorithm.
The viterbi algo defines best score along a single path, at time t, that accounts for the first t observations and ends in state Si
What are the challenges in an ASR system?
- Atypical speakers (i.e. children, speech impediments)
- Colloquiums, um, er, coughs
- Noise - incorporate visual information
- emotion and intent
- Limits of current approaches -> use of DL
What is the McGurk effect?
We don’t perceive speech just from sound, it is audio-visual.
The shape that the mouth makes also has an influence on the type of sound being perceived.
Audio ‘ba’ + video ‘fa’ perceive ‘fa’
Audio ‘ba’ + video ‘ba’ perceive ‘ba’