Asr Flashcards

Question 1

Q

What is Librispeech?

Answer

A

LibrisSpeech large read speech 16 Khz 1000 hours of Audio books. They have a clean and other based on their WER

Question 2

Q

What is data segmentation in this context?

Question 3

Q

What are phonons and canons?

Answer

A

Phonemes and character in context

Question 4

Q

What are the 2 main categories and types of ASR models?

Answer

A

HMM-Based Model and end-to-end models

Question 5

Q

What is MustC

Answer

A

Introduced by Gangi et al. in MuST-C: a Multilingual Speech Translation Corpus
MuST-C currently represents the largest publicly available multilingual corpus (one-to-many) for speech translation. It covers eight language directions, from English to German, Spanish, French, Italian, Dutch, Portuguese, Romanian and Russian. The corpus consists of audio, transcriptions and translations of English TED talks, and it comes with a predefined training, validation and test split

Question 6

Q

What is an alternative in speech recognition to the encoder decoder architecture?

Answer

A

CTC the connectionist temporal classification ( it is also the name of the loss function)

Question 7

Q

Do we use all the input frames ? Why?

Answer

A

No we usually skip frames. This is important to stay up to datewith speech in online dictation. Also we assume the signal does not change in that frame. This is the criteria to choose how much to skip

Question 8

Q

What is the Word typical error rate on general on different types of datasets for speech recogition?

Answer

A

On read speech is ~ 2%

Conversations is between 5.8 and 11%

even more with accents noise etc

Question 9

Q

Question 10

Q

What is a disadvantage of end to end techniques?

Answer

A

They require more data to train to achieve the same performance of hybrid model. Bu they are usually not phonetic based so they are less expensive in that sense. They do not require a phonetic lexicons

Question 11

Q

On what examples are grapheme thought to be weaker than phoneme?

Answer

A

Proper noun and rare words but they are now pretty good.

Question 12

Q

What is a phoneme?

Answer

A

Unique, discreet unit of language that can be used to differentiate words.

You can also see as something that if you change it it can change the meaning of a word,

Question 13

Q

What is TIMIT?

Answer

A

The TIMIT Acoustic-Phonetic Continuous Speech Corpus is a standard dataset used for evaluation of automatic speech recognition systems. It consists of recordings of 630 speakers of 8 dialects of American English, each reading 10 phonetically rich sentences. It also comes with the word and phone-level transcriptions of the speech.

Phone boundaries are hand marked.

Question 14

Q

What are the parts of an HMM-based model and what do they do?

Answer

A

An HMM-based model is divided into three parts:acoustic, pronunciation and language model. In HMM based model, each model is independent of each other and plays a different role. While the acoustic model models the mapping between speech input and feature sequence, the pronunciation model maps between phonemes (or sub-phonemes) to graphemes, and the language model maps the character sequence to fluent final transcription.

Question 15

Q

What are typical datasets used in the team?

Answer

A

Accents: non native German speaker with accents

Apttek: colloquial phone conversation

Multidistances
Native German telling their stories but recorded at different distances

Question 16

Q

What is commonvoice?

Answer

A

Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages

Question 17

Q

What is likely the harder punctuation to model?

Question 18

Q

What does LVCSR stands for?

Answer

A

Large Vocabulary Speech Recognition (LVCSR)
LVCSR can be divided into two categories: HMM-based model and the end-to-end model.

Question 19

Q

What are the two main deficiencies of CTC models>?

Answer

A

CTC cannot model interdependencies within the output sequence because it assumes that output elements are independent of each other. Therefore, CTC cannot learn the language model. The speech recognition network trained by CTC should be treated as only an acoustic model.
CTC can only map input sequences to output sequences that are shorter than it. Thus, it is powerless for scenarios where output sequence is longer.

Question 20

Q

What is Switchboard?

Answer

A

Corpus of telephone conversation among strangers from early 90’s 2430 conversation on average of 6 mins with 240 hours at 8khz.

It has tons of linguistic labellings

Question 21

Q

Is the FFT spectrogram output? small enough?

Answer

A

No it is still too big so we applied a weighted average and we shrink the size we sum them up weighted on the Mel scale

Question 22

Q

What is a strong conditional assumption that a CTC model makes (expecially during inference?)

Answer

A

That the output at time t is independent from the time at each of the other times. So to get P(Y|X) you just need prod(p(at|X).

With an argmax you can get the inference.

When you do this you have to some over all the possible alignment that goes into the same final utteance.

Question 23

Q

Is the collapsing function of CTC many to one?

Answer

A

Yes different long utterances can be collpsed into the same final utterance. Indeed you have to sum over all of them in several places like the loss calculation.

Question 24

Q

What is the main disadvantage of phonetic-based models?

Answer

A

You need a phonetic lexicon created by expert and linguist which is very expensive and hard to scale

Question 25

Q

What are the component of an ASR system?

Answer

A

Feature Extraction: It converts the speech signal into a sequence of acoustic feature vectors. These observations should be compact and carry sufficient information for recognition in the later stage.
Acoustic Model: It Contains a statistical representation of the distinct sounds that make up each word in the Language Model or Grammar. Each distinct sound corresponds to a phoneme.
Language Model: It contains a massive list of words and their probability of occurrence in a given sequence.
Decoder: It is a software program that takes the sounds spoken by a user and searches the acoustic Model for the equivalent sounds. When a match is made, the decoder determines the phoneme corresponding to the sound. It keeps track of the matching phonemes until it reaches a pause in the user’s speech. It then searches the language model for the equivalent series of phonemes. If a match is made, it returns the text of the corresponding word or phrase to the calling program.

Question 26

Q

Why is ASR IMPORTANT?

Answer

A

It allows you to be hands free, it is a more natural way to communicate and improve accesability

Question 27

Q

What are senones?

Answer

A

The concept was invented for ASR.
It means to group phone into triple a leading part a stable part and a trailing part

Note that this why senones depends on context.

The leading and trailing part depend on the ohones beforesnd after

Question 28

Q

What do we mean by alignment in the asr domain?

Question 29

Q

Why is punctuation important in the ASR system’s output?

Answer

A

For readability but not only, it is crucial for NLP better understanding.

Question 30

Q

Limitations of HMM-models

Answer

A

The training process is complex and difficult to be globally optimized. HMM-based model often uses different training methods and data sets to train different modules. Each module is independently optimized with their own optimization objective functions which are generally different from the true LVCSR performance evaluation criteria. So the optimality of each module does not necessarily bring global optimality. 
Conditional independence assumptions. To simplify the model’s construction and training, the HMM-based model uses conditional independence assumptions within HMM and between different modules. This does not match the actual situation of LVCSR.

Question 31

Q

What is the intuition behind CTC?

Answer

A

The idea is to have an output for every input, i.e every audio frame.then we collpase the output into the actual final shorter sentence.

Question 32

Q

What is speech commands?

Answer

A

Introduced by Warden in Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
Speech Commands is an audio dataset of spoken words designed to help train and evaluate keyword spotting systems.

Question 33

Q

What is a big issue in terms of loss function for DNN LSTM based models?

Answer

A

Although HMM-DNN provides still state-of-the-art results, the role played by DNN is limited. It is mainly used to model the posterior state probability of HMM’s hidden state. The time-domain feature is still modeled by HMM. When attempting to model time-domain features using RNN or CNN instead of HMM, it faces a data alignment problem: both RNN and CNN’s loss functions are defined at each point in the sequence, so in order to be able to perform training, it is necessary to know the alignment relation between RNN output sequence and target sequence.

Question 34

Q

Is it important when using CTC to use a language model?

Answer

A

Yes because due to the conditional independece made by the algorithm it does not lean a LM at all during training

Question 35

Q

What is triphone clustering and why is it done?

Answer

A

Getting the combination of phones can be combinatorially hard so we cluster them this clustering can also be learned by the model

Question 36

Q

Do phones represent a state in the hmm model?

Answer

A

No. We use senones a triple with a trailing and leading part

Question 37

Q

Why is the encoder decoder (with attetion ) particularly usefull in speech?

Answer

A

Because input and output have very different lenghts. Note that this lenght is extreme for speech so we need some form of compressions

Question 38

Q

Why is ASR DIFFICULT?

Answer

A

There are several sources of variability
Style continuous speech or different words conversation reading dictation etc

Environment noise distance from microphone

Rate of speech

Accents

Question 39

Q

What are the similarities and differences of CTC and LSTM models?

Answer

A

The RNN-transducer has many similarities with CT: their main goals are to solve the forced segmentation alignment problem in speech recognition; they both introduce a “blank” label; they both calculate the probability of all possible paths and aggregate them to get the label sequence.

However, their path generation processes and the path probability calculation methods are completely different. This gives rise to the advantages of RNN-transducer over CTC.

Question 40

Q

Do ASR system usually include punctuation?

Answer

A

no not usually

Question 41

Q

What is ASR at the end of the day?

Answer

A

A search optimization problem we want to find among all words the most likely given the input but there are too many so we use models to compute probability and prune the search

Question 42

Q

What are some internal application af ASR?

Answer

A

Skype Cortana cognitive office dictation etc and teams.

Question 43

Q

The end-to-end model are mainly divided into three different categories

Answer

A

CTC-based: It first enumerates all possible hard alignments. Then, it achieves soft alignment by aggregating these hard alignments. CTC assumes that output labels are independent of each other when enumerating hard alignments.
RNN-transducer: It also enumerates all possible hard alignments and then aggregates them for soft alignment. But unlike CTC, RNN-transducer does not make independent assumptions about labels when enumerating hard alignments. Thus, it is different from CTC in terms of path definition and probability calculation.
Attention-based: This method no longer enumerates all possible hard alignments, but uses attention mechanism to directly calculate the soft alignment information between input data and output label.

Question 44

Q

What is usually the reduction from am FFT spectrogram to after the mel weighted averages?

Answer

A

You go from a 512 768 FFT to around 40+ dimensionsiions

Question 45

Q

What type of ML problem/task, is a punctuation restoration problem?

Answer

A

It is a sequence labelling tasks with punctuation classes , . ‘ ? and

Question 46

Q

Apart from staying up to speed with the translation if it is online what is another reason to do downsampling?

Answer

A

Because the lenght of the input, the mel log feature and the output, letters or word is very different. If extreme you need compression

Question 47

Q

Question 48

Q

What difficulties does the CTC model overcame?

Answer

A

CTC mainly overcomes the following two difficulties for end-to-end LVCSR models:

Data alignment problem. CTC no longer needs to segment and align training data. This solves the alignment problem so that DNN can be used to model time-domain features, which greatly enhances DNN’s role in LVCSR tasks.

Directly output the target transcriptions. Traditional models often output phonemes or other small units, and further processing is required to obtain the final transcriptions. CTC eliminates the need for small units and direct output in final target form, greatly simplifying the construction and training of an end-to-end model.

Question 49

Q

How much big is usually the vocabulary of the phonemes.

Answer

A

Around 20 60 phonemes

Question 50

Q

Can the WER be bigger than one?

Answer

A

Yes, think about a case were you predict a long sentence while the ground truth is just one word.

Question 51

Q

Types of errors made by speech recognizers?

Answer

A

Out-of-vocabulary (OOV) errors: Current state-of-the-art speech recognizers have closed vocabularies. This means that they are incapable of recognizing words outside their training vocabulary. Besides misrecognition, the presence of an out-of-vocabulary word in an input utterance causes the system to err to a similar word in its vocabulary. Special techniques for handling OOV words have been developed for HMM-GMM and neural ASR systems (see, e.g., Zhang, 2019).
- Homophone substitution: These errors can occur if more than one lexical entry has the same pronunciation (phone sequence), i.e., they are homophones. While decoding, homophones may be confused with one another, causing errors. In general, a well-functioning language model should disambiguate homophones based on the context.
Language model bias: Because of an undue bias towards the language model (effected by a high relative weight on the language model), the decoder may be forced to reject the true hypothesis in favor of a spurious candidate with high language model probability. These errors may occur along with analogous acoustic model bias.
Multiple acoustic problems: This is a broad category of errors comprising those due to bad pronunciation entries; disfluency, mispronunciation by the speaker himself/herself, or errors made by acoustic models (possibly due to acoustic noise, data mismatch between training and usage etc.).

Question 52

Q

What is the ty,pical size of a time split of the input audio? What is the assumption?

Answer

A

It varies from 10 to 25 Ms we are always assuming the signal is stationary

Question 53

Q

Why were senones created?

Answer

A

Because it was hard to go from phone to words

Question 54

Q

We can we not simply use all the triphone to model speech?

Answer

A

There are too many 40^3 = 60K but only 40K will be used

How many triphones are there? Consider a 40 phone system. 403 = 64 000 possible triphones. In a cross-word system maybe 50 000 can occur Number of parameters: 50 000 three-state HMMs, with 10 component Gaussian mixtures per state: 1.5M Gaussians 39-dimension feature vectors (12 MFCCs + energy), deltas and accelerations Assuming diagonal Gaussians: about 790 parameters/state Total about 118 million parameters! We would need a very large amount of training data to train such a system to enable robust estimation of all parameters to ensure that all possible triphones are observed (more than once) in the training data

Question 55

Q

Does DNN in the HMM framework compute the output-emission probabilities?

Answer

A

No they compute the P(q|x) they are discriminative, see the image below. Luckily you can use the Bayes theorem to invert that probability

Question 56

Q

What is one of the great benefits of HMM?

Answer

A

The precise temporal structure allows you to avoid any explicit hand segmentation in terms of speech units like phonemes. So excellent for continuous speech. there are a lot of assumptions behind it but it works well and with the Viterbi implementation, it is pretty fast

Question 57

Q

Question 58

Q

What are advantages of DNN vs HMM/GMM system?

Answer

A

Advantages of NN:

Can easily model correlated features Correlated feature vector components (eg spectral features) Input context – multiple frames of data at input

More flexible than GMMs – not made of (nearly) local components); GMMs inefficient for non-linear class boundaries

NNs can model multiple events in the input simultaneously – different sets of hidden units modelling each event; GMMs assume each frame generated by a single mixture component.

NNs can learn richer representations and learn ‘higher-level’ features (tandem, posteriorgrams, bottleneck features)

Question 59

Q

What were disadvantages of DNN compared to HMM/GMM back in the 90s?

Answer

A

Disadvantages of NNs in the 1990s:

Context-independent (monophone) models, weak speaker adaptation algorithms

NN systems less complex than GMMs (fewer parameters): RNN – < 100k parameters, MLP – ∼ 1M parameters

Computationally expensive - more difficult to parallelise training than GMM sstems

Now not like that anymore

Question 60

Q

How do we compute posterior probabilities of phonemes in a HMM/DNN architecture?

Answer

A

Posterior probability estimation Consider a neural network trained as a classifier – each output corresponds to a class. When applying a trained network to test data, it can be shown that the value of output corresponding to class j given an input xt , is an estimate of the posterior probability P(qt = j|xt). (This is because we have softmax outputs and use a cross-entropy loss function) Using Bayes Rule we can relate the posterior P(qt = j|xt) to the likelihood p(xt |qt = j) used as an output probability in an HMM: P(qt |xt) = p(xt |qt = j)P(qt = j) p(xt) ASR Lecture 11 Neural Networks for Acoustic Modelling 2: HMM/DNN 7 Scaled likelihoods If we would like to use NN outputs as output probabilities in an HMM, then we would like probabilities (or densities) of the form p(x|q) – likelihoods. We can write scaled likelihoods as: P(qt = j|xt) p(qt = j) = p(xt |qt = j) p(xt) Scaled likelihoods can be obtained by “dividing by the priors” – divide each network output P(qt = j|xt) by P(qt), the relative frequency of class j in the training data Using p(xt |qt = j)/p(xt) rather than p(xt |qt = j) is OK since p(xt) does not depend on the class j Use the scaled likelihoods obtained from a neural network in place of the usual likelihoods obtained from a GMM

Question 61

Q

Why are HMM not super good in accuracy in particular given the step of the GMM?

Answer

A

Because they are trained to maximized the likelihood. They want P(X|M) where M is the best HMM(set of states) they do not minimize for the wrong states, they are not discriminative.

Question 62

Q

What does TDNN stands for?

Answer

A

Time delay Neural network

Question 63

Q

What are TDNN and why where are they used in speech

Answer

A

They are used in the hybrid HMM/DNN models in place of DNN or RNN.

They are time delay neural networks

The input at time t is connected with the neuron inputs at times t-n where is the context. So they are very powerful are modelling time series and they are a bit faster to train compared to RNN

The TDNN is essentially a 1-d convolutional neural network without pooling and with dilations.

see Time Delay Neural Network – KaleidoEscape – Linguist turned Programmer

Question 64

Q

What is the key difference in the input features you can use for tdnn and GMM in hybrid HMM systems?

Answer

A

GMMs: filter bank features (spectral domain) not used as they are strongly correlated with each other – would either require full covariance matrix Gaussians many diagonal covariance Gaussians
- DNNs do not require the components of the feature vector to be uncorrelated Can directly use multiple frames of input context (this has been done in NN/HMM systems since 1990, and is crucial to make them work well) Can potentially use feature vectors with correlated components (e.g. filter banks)
Experiments indicate that mel-scaled filter bank features (FBANK) result in greater accuracy than MFCCs

Answer 59

A

Yes but you can use an iterative approach

ANNs trained for classification require supervision (labeled targets for each pattern). An early problem in applying ANN methods to speech recognition was the apparent requirement of hand-labeled frames for ANN training. Since the ANN outputs can be used in the dynamic programming for global decoding (after division by the prior probabilities), it is possible to use embedded Viterbi training to iteratively optimize both the segmentation and the ANN parameters. In this procedure, illustrated in Fig. 8, each ANN training is done using labels from the previous Viterbi alignment. In turn, an ANN is used to estimate training set state probabilities, and dynamic programming given the training set models is used to determine the new labels for the next ANN training. Of course, as for standard HMM Viterbi training, one must start this procedure somewhere, and also have a consistent criterion for stopping. Many initializations can be used

Answer 60

A

being able to generalize to condition you were not trained on in terms of accents noise channel environment distance etc

Answer 61

A

Neural networks, both feed-forward and recurrent, can be only used for frame-wise classification of the input audio.

This problem can be addressed using:

Hidden Markov Models (HMMs) to get the alignment between the input audio and its transcribed output.
Connectionist Temporal Classification (CTC) loss, which is the most common technique.

Answer 62

A

Large unsupervised pretrained model.

Answer 63

A

1M agaist the 1K data of labelledd data used typically

Answer 64

A

They enlarge the semi supervised (not humanly annotated dataset) from 30K to 680K getting closer to the unsupervised size of 1M used by wave2vec

Answer 65

A

Is a speech recognition model from Open AI

They got 680K of data from the internet (weakly supervised) multilingual and multitask. The goal is to create a robust zero-shot model

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

Answer 66

A

~6000

As demonstrated by Narayanan et al. (2018), Likhomanenko et al. (2020), and Chan et al. (2021) speech recognition systems that are pre-trained in a supervised fashion across many datasets/domains exhibit higher robustness and generalize much more effectively to held-out datasets than models trained on a single source. These works achieve this by combining as many existing high-quality speech recognition datasets as possible. However, there is still only a moderate amount of this data easily available. SpeechStew (Chan et al., 2021) mixes together 7 pre-existing datasets totalling 5,140 hours of supervision. While not insignificant, this is still tiny compared to the previously mentioned 1,000,000 hours of unlabeled speech data utilized in Zhang et al. (2021).

Answer 67

A

Multilingual LibriSpeech (MLS) (Pratap et al., 2020b) and

VoxPopuli (Wang et al., 2021)

Answer 68

A

VoxPopuli is a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours.

Answer 69

A

XLS-R and mSLAM

XLS-R (Babu et al., 2021) and mSLAM (Bapna et al., 2022)

Answer 70

A

covost2 · Datasets at Hugging Face

Dataset Summary

CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English
and from English into 15 languages. The dataset is created using Mozillas open-source Common Voice database of
crowdsourced voice recordings. There are 2,900 hours of speech represented in the corpus.

Supported Tasks and Leaderboards

speech-translation: The dataset can be used for Speech-to-text translation (ST). The model is presented with an audio file in one language and asked to transcribe the audio file to written text in another language. The most common evaluation metric is the BLEU score. Examples can be found at https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/covost_example.md .

Languages

The dataset contains the audio, transcriptions, and translations in the following languages, French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian, Chinese, Welsh, Catalan, Slovenian, Estonian, Indonesian, Arabic, Tamil, Portuguese, Latvian, and Japanese.

Answer 71

A

Fleurs dataset (Conneau et al., 2022)

Answer 72

A

You can inject noise directly in the wave form:

There are 2 white noise, pub noise and I guess you can do more like injecting one data over another at lower SNR

You can speed up the waveform

cut out frame

Answer 73

A

Contextualized ASR systems fall into two categories: word-level context and utterance-level context. Word-level context aims to enhance the recognition accuracy of rare words, while utterance-level context carries more sophisticated information such as topic and logical relationships.

Answer 74

A

1) you optimize jointly for lexical and display so
more simple
and likely more effective accurate

2) you can leverage entity rich human caption model data for licare for a lot of data