06 - Automatic Speech Recognition Flashcards

1
Q

What makes ASR difficult (name two)?

A

Size, speaker, acoustic environment, style, accent/dialect, languages (amount of data etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is ASR built from in two concepts (think front end and back end)?

A

A front-end feature extraction and a back-end classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ASR is not a sequence2sequence problem. True or false?

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the main trend in the third generation (3G) of ASR?

A

Hierarchical modelling of speech.
AM –> LM –> PM (using HMM and GMM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the main trend in the fourth generation (4G) of ASR?

A

End2end. Direct mapping from acoustics to words/characters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the most important metric in ASR?

A

WER = Word Error Rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the equation for the Word Error Rate (WER)?

A

WER = (I + S + D) / N

Simple:
I = Insertion
S = Substitution
D = Deletion

I: The number of insertion errors, which represents the additional or extra words inserted in the output compared to the reference (ground truth).
S: The number of substitution errors, which indicates the words that are substituted or replaced in the output compared to the reference.
D: The number of deletion errors, which represents the words that are missing or deleted in the output compared to the reference.
N: The total number of words in the reference (ground truth) transcript.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the two building blocks of the ‘conventional’ build of an Acoustic Model in the ASR?

A

It is the HMM (Hidden Markov Models) and the GMM (Gaussian Mixture Models)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the three basic HMM problems?

A
  1. Evaluation/scoring (forward algorithm)
  2. Decoding (viterbi algorithm)
  3. Training (baum-welch algorithm)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the ‘Forward Algorithm’ do?

A

It permits computing the probability P(X|Ø) based on dynamic programming in O(N^2) time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the ‘Viterbi Algorithm’ used for?

A

It is used to find the best state sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the ‘Baum-Welch algorithm’ do?

A

It makes use of the forward probabilites and the backward probabilities to estimate the state occupation probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What probability does the ‘Acoustic Model’ calculate?

A

P(X|W). That is, the probability of a vector sequence, X given an utterance, W.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the language Model calculate the probability of?

A

P(W), that is. The probability of a given utterance, W.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is CFG?

A

Context Free Grammar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When do we use CFG instead of n-gram?

A

When the sentences or phrases are complete and non-complex. Very simple phrases for example.

17
Q

When do we use n-grams instead of CFG (context free grammar)?

A

When there are comlex and maybe incoherent senteces or phrases. It is used when word sequence probabilities are important, e.g. in text prediction and speech classification.

18
Q

What do we use the ‘Pronunciation model’ for?

A

To find the optimal, W. That is utterance.

19
Q

We cannot perform continuous speech recognition with HMM’s. True or false?

A

False. We can and do.

20
Q

What does ‘LVCSR’ stand for?

A

Large Vocabulary Continuous Speech Recognition

21
Q

When we turn our attention to more modern approaches, the observation likelihood used in the ‘conventional’ methods is replaced by?

A

A (scaled) neural network posterier.

22
Q

Some disadvantes of using Neural Networks until 2012 were?

A

It was computationally expensive (still is) - moreso back then.
Neural network systems were less complex before then.

23
Q

What are the differences (progress) in Automatic Speech Recognition after 2012?

A

We have deeper and wider networks and our computers and GPU’s allow for much faster training.

24
Q

Time-delay Neural Networks (TDNN’s) model richer content. True or false?

25
Since GMM/HMM was proposed the approaches were more?
Statistical.
26
Hierarchical (3G) systems is often split into 3 (sometimes more) concepts. These are types of models, and are commonly what kind of models?
Acoustic model, language model (n-grams) and pronunciation model. Also a decoder sometimes.
27
We see a huge improvement in Large Vocabulary Continuous Speech Recognition (LVCSR) since 2012 due to what?
The positive improvement and impact of deep learning.
28
Newer deep learning ASR systems replace the acoustic model of a hierarchical/statistical system by what?
A deep neural network. This gives better accuracy and context modelling.