Asr Flashcards
What is Librispeech?
LibrisSpeech large read speech 16 Khz 1000 hours of Audio books. They have a clean and other based on their WER
What is data segmentation in this context?
What are phonons and canons?
Phonemes and character in context
What are the 2 main categories and types of ASR models?
HMM-Based Model and end-to-end models
What is MustC
Introduced by Gangi et al. in MuST-C: a Multilingual Speech Translation Corpus
MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split
What is an alternative in speech recognition to the encoder decoder architecture?
CTC the connectionist temporal classification ( it is also the name of the loss function)
Do we use all the input frames ? Why?
No we usually skip frames. This is important to stay up to datewith speech in online dictation. Also we assume the signal does not change in that frame. This is the criteria to choose how much to skip
What is the Word typical error rate on general on different types of datasets for speech recogition?
On read speech is ~ 2%
Conversations is between 5.8 and 11%
even more with accents noise etc
What is a disadvantage of end to end techniques?
They require more data to train to achieve the same performance of hybrid model. Bu they are usually not phonetic based so they are less expensive in that sense. They do not require a phonetic lexicons
On what examples are grapheme thought to be weaker than phoneme?
Proper noun and rare words but they are now pretty good.
What is a phoneme?
Unique, discreet unit of language that can be used to differentiate words.
You can also see as something that if you change it it can change the meaning of a word,
What is TIMIT?
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists of recordings of 630 speakers of 8 dialects of American English, each reading 10 phonetically rich sentences. It also comes with the word and phone-level transcriptions of the speech.
Phone boundaries are hand marked.
What are the parts of an HMM-based model and what do they do?
An HMM-based model is divided into three parts:acoustic, pronunciation and language model. In HMM based model, each model is independent of each other and plays a different role. While the acoustic model models the mapping between speech input and feature sequence, the pronunciation model maps between phonemes (or sub-phonemes) to graphemes, and the language model maps the character sequence to fluent final transcription.
What are typical datasets used in the team?
Accents: non native German speaker with accents
Apttek: colloquial phone conversation
Multidistances
Native German telling their stories but recorded at different distances
What is commonvoice?
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages
What is likely the harder punctuation to model?
Commas
What does LVCSR stands for?
Large Vocabulary Speech Recognition (LVCSR)
LVCSR can be divided into two categories: HMM-based model and the end-to-end model.
What are the two main deficiencies of CTC models>?
- CTC cannot model interdependencies within the output sequence because it assumes that output elements are independent of each other. Therefore, CTC cannot learn the language model. The speech recognition network trained by CTC should be treated as only an acoustic model.
- CTC can only map input sequences to output sequences that are shorter than it. Thus, it is powerless for scenarios where output sequence is longer.
What is Switchboard?
Corpus of telephone conversation among strangers from early 90’s 2430 conversation on average of 6 mins with 240 hours at 8khz.
It has tons of linguistic labellings
Is the FFT spectrogram output? small enough?
No it is still too big so we applied a weighted average and we shrink the size we sum them up weighted on the Mel scale
What is a strong conditional assumption that a CTC model makes (expecially during inference?)
That the output at time t is independent from the time at each of the other times. So to get P(Y|X) you just need prod(p(at|X).
With an argmax you can get the inference.
When you do this you have to some over all the possible alignment that goes into the same final utteance.
Is the collapsing function of CTC many to one?
Yes different long utterances can be collpsed into the same final utterance. Indeed you have to sum over all of them in several places like the loss calculation.
What is the main disadvantage of phonetic-based models?
You need a phonetic lexicon created by expert and linguist which is very expensive and hard to scale
What are the component of an ASR system?
Feature Extraction: It converts the speech signal into a sequence of acoustic feature vectors. These observations should be compact and carry sufficient information for recognition in the later stage.
Acoustic Model: It Contains a statistical representation of the distinct sounds that make up each word in the Language Model or Grammar. Each distinct sound corresponds to a phoneme.
Language Model: It contains a massive list of words and their probability of occurrence in a given sequence.
Decoder: It is a software program that takes the sounds spoken by a user and searches the acoustic Model for the equivalent sounds. When a match is made, the decoder determines the phoneme corresponding to the sound. It keeps track of the matching phonemes until it reaches a pause in the user’s speech. It then searches the language model for the equivalent series of phonemes. If a match is made, it returns the text of the corresponding word or phrase to the calling program.
Why is ASR IMPORTANT?
It allows you to be hands free, it is a more natural way to communicate and improve accesability
What are senones?
The concept was invented for ASR.
It means to group phone into triple a leading part a stable part and a trailing part
Note that this why senones depends on context.
The leading and trailing part depend on the ohones beforesnd after
What do we mean by alignment in the asr domain?
Why is punctuation important in the ASR system’s output?
For readability but not only, it is crucial for NLP better understanding.
Limitations of HMM-models
The training process is complex and difficult to be globally optimized. HMM-based model often uses different training methods and data sets to train different modules. Each module is independently optimized with their own optimization objective functions which are generally different from the true LVCSR performance evaluation criteria. So the optimality of each module does not necessarily bring global optimality. Conditional independence assumptions. To simplify the model’s construction and training, the HMM-based model uses conditional independence assumptions within HMM and between different modules. This does not match the actual situation of LVCSR.
What is the intuition behind CTC?
The idea is to have an output for every input, i.e every audio frame.then we collpase the output into the actual final shorter sentence.
What is speech commands?
Introduced by Warden in Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems.
What is a big issue in terms of loss function for DNN LSTM based models?
Although HMM-DNN provides still state-of-the-art results, the role played by DNN is limited. It is mainly used to model the posterior state probability of HMM’s hidden state. The time-domain feature is still modeled by HMM. When attempting to model time-domain features using RNN or CNN instead of HMM, it faces a data alignment problem: both RNN and CNN’s loss functions are defined at each point in the sequence, so in order to be able to perform training, it is necessary to know the alignment relation between RNN output sequence and target sequence.