06 - Automatic Speech Recognition Flashcards

1
Q

What makes ASR difficult (name two)?

A

Size, speaker, acoustic environment, style, accent/dialect, languages (amount of data etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is ASR built from in two concepts (think front end and back end)?

A

A front-end feature extraction and a back-end classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ASR is not a sequence2sequence problem. True or false?

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the main trend in the third generation (3G) of ASR?

A

Hierarchical modelling of speech.
AM –> LM –> PM (using HMM and GMM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the main trend in the fourth generation (4G) of ASR?

A

End2end. Direct mapping from acoustics to words/characters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the most important metric in ASR?

A

WER = Word Error Rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the equation for the Word Error Rate (WER)?

A

WER = (I + S + D) / N

Simple:
I = Insertion
S = Substitution
D = Deletion

I: The number of insertion errors, which represents the additional or extra words inserted in the output compared to the reference (ground truth).
S: The number of substitution errors, which indicates the words that are substituted or replaced in the output compared to the reference.
D: The number of deletion errors, which represents the words that are missing or deleted in the output compared to the reference.
N: The total number of words in the reference (ground truth) transcript.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the two building blocks of the ‘conventional’ build of an Acoustic Model in the ASR?

A

It is the HMM (Hidden Markov Models) and the GMM (Gaussian Mixture Models)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the three basic HMM problems?

A
  1. Evaluation/scoring (forward algorithm)
  2. Decoding (viterbi algorithm)
  3. Training (baum-welch algorithm)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the ‘Forward Algorithm’ do?

A

It permits computing the probability P(X|Ø) based on dynamic programming in O(N^2) time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the ‘Viterbi Algorithm’ used for?

A

It is used to find the best state sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the ‘Baum-Welch algorithm’ do?

A

It makes use of the forward probabilites and the backward probabilities to estimate the state occupation probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What probability does the ‘Acoustic Model’ calculate?

A

P(X|W). That is, the probability of a vector sequence, X given an utterance, W.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the language Model calculate the probability of?

A

P(W), that is. The probability of a given utterance, W.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is CFG?

A

Context Free Grammar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When do we use CFG instead of n-gram?

A

When the sentences or phrases are complete and non-complex. Very simple phrases for example.

17
Q

When do we use n-grams instead of CFG (context free grammar)?

A

When there are comlex and maybe incoherent senteces or phrases. It is used when word sequence probabilities are important, e.g. in text prediction and speech classification.

18
Q

What do we use the ‘Pronunciation model’ for?

A

To find the optimal, W. That is utterance.

19
Q

We cannot perform continuous speech recognition with HMM’s. True or false?

A

False. We can and do.

20
Q

What does ‘LVCSR’ stand for?

A

Large Vocabulary Continuous Speech Recognition

21
Q

When we turn our attention to more modern approaches, the observation likelihood used in the ‘conventional’ methods is replaced by?

A

A (scaled) neural network posterier.

22
Q

Some disadvantes of using Neural Networks until 2012 were?

A

It was computationally expensive (still is) - moreso back then.
Neural network systems were less complex before then.

23
Q

What are the differences (progress) in Automatic Speech Recognition after 2012?

A

We have deeper and wider networks and our computers and GPU’s allow for much faster training.

24
Q

Time-delay Neural Networks (TDNN’s) model richer content. True or false?

A

True.

25
Q

Since GMM/HMM was proposed the approaches were more?

A

Statistical.

26
Q

Hierarchical (3G) systems is often split into 3 (sometimes more) concepts. These are types of models, and are commonly what kind of models?

A

Acoustic model, language model (n-grams) and pronunciation model. Also a decoder sometimes.

27
Q

We see a huge improvement in Large Vocabulary Continuous Speech Recognition (LVCSR) since 2012 due to what?

A

The positive improvement and impact of deep learning.

28
Q

Newer deep learning ASR systems replace the acoustic model of a hierarchical/statistical system by what?

A

A deep neural network. This gives better accuracy and context modelling.