07 - End-2-End Speech Recognition Flashcards

Question 1

Q

Let us start with a reminder. What is an end-2-end system or model?

Answer

A

A sytem or model that takes in an input and directly models an output without relying on intermediate stages or processes.

Question 2

Q

HMM-based problems are normally split into 3 separate models, which are?

Answer

A

The Acoustic Model (AM), the Language Model (LM) and the Pronunciation Model (PM).

Question 3

Q

HMM is quick and efficient and does not require expert knowledge. True or false?

Answer

A

False. HMM’s require expert knowledge and is very time consuming because of the 3 models that underly in the structure.

Question 4

Q

In the acoustic model (that is the sound signal input) what are the main issues (there are 3)?

Answer

A

The input is of variable length.
The input is often much larger than the output.
We do not know how the input audio features align with the output characters.

Question 5

Q

What are the two main architectures used to solve the ‘alignment problem’ in end-2-end models?

Answer

A

The CTC (Connectionist Temporal Classification) and the ‘seq2seq-attention’

Question 6

Q

The CTC was proposed in 2006 as a model that can train an acoustic model without requiring segmentation and alignment. True or false?

Question 7

Q

An original approach in speech recognition outputted phonemes and NOT words. Is it still an end-2-end model when the output is phonemes?

Answer

A

NO. It is not the complete process. You would have to transfer the phonemes into words afterwards thus relying on an intermediate process.

Question 8

Q

The CTC does not allow for identifying consecutive character repetitions. For example in the word ‘hello’ (2 l’s). What special token does CTC introduce to combat this?

Answer

A

The blank (ϵ) token.

Question 9

Q

The CTC mapping makes use of the blank token (ϵ) to comprehend what?

Answer

A

An instance where a word has two consecutive letters. This is because input and output can have different lengths.

Question 10

Q

The probability of an alignment in the CTC is what?

Answer

A

The dot-product of the probabilities at each time step.

Question 11

Q

CTC can perform inference, and uses two search mechanisms. These are?

Answer

A

The greedy search and the beam search

Question 12

Q

What is the ‘Greedy Search’?

Answer

A

This takes the most likely output at each time-step. It will give us the alignment with the highest probability.

Question 13

Q

What is the ‘Beam Search’?

Answer

A

It computes a new set of hypotheses at each input step with all possible combinations, but keeps only the top candidates.

Question 14

Q

All network outputs of the CTC are conditionally dependent. True or false?

Answer

A

False. They are conditionally independent.

Question 15

Q

Is CTC (used alone) a real end-2-end approach?

Answer

A

No. Because in order to obtain good performance it relies on the use of external language models.

Question 16

Q

The ‘seq-2-seq-attention’ architecture is made of 3 blocks. These are?

Answer

A

An encoder
An attention mechanism
A decoder

Question 17

Q

In ‘seq-2-seq-attention’ what is the role of the encoder?

Answer

A

The encoder is analogous to an Acoustic Model (AM) and transforms speech features into a higher level representation.

Question 18

Q

In ‘seq-2-seq-attention’ what is the role of the attention mechanism?

Answer

A

This is analogous to an Alignment Model that chooses encoded frames that are relevant to produce output.

Question 19

Q

In ‘seq-2-seq-attention’ what is the role of the decoder?

Answer

A

It is analogous to a Language Model (LM) that predicts each token as a function of the previous predictions and outputs it (after having applied softmax).

Question 20

Q

Often dimensionality can be a problem in the encoder part of ‘seq-2-seq-attention’. It is common to use what NN in the first layers?

Answer

A

A Convolution Neural Network. After that you might apply BLSTM’s (bi-directional LSTM) to reduce the encoded features.

Question 21

Q

The first example of a ‘seq-2-seq-attention’ model was the Listen, Attend and Spell. The output here were characters and not words. Was is still an end-2-end model if the desired output was characters?

Answer

A

Yes, but if the desired output was words, then no.

Question 22

Q

In ‘seq-2-seq-attention’ a problem is that it is easily affected by noise. True or false?

Question 23

Q

In ‘seq-2-seq-attention’ it is better to train with longer input sequences first and then shorter input sequences. True or false?

Answer

A

False. You want to train with the short input sequences first.

Question 24

Q

An encoder-decoder approach can be ‘streamed’. True or false?

Answer

A

False. The encoder needs to complete the input, before the decoder can start working.

Question 25

Q

In CTC, seq2seq and the hybrid CTC/Attention, the ‘beam search’ or the ‘greedy search’ is more common?

Answer

A

The Beam Search

Question 26

Q

The ‘Hybrid CTC/attention decoding’ is more robust due to?

Answer

A

It is better with appropriate alignments in noisy environments. It uses Joint decoding during recognition.

Question 27

Q

What is the ESPnet Toolkit used for?

Answer

A

It is dedicated to end-2-end speech processing!

Question 28

Q

What can the RNN-T do that the CTC and other cannot?

Answer

A

It can stream due to its use of a joint network.

Question 29

Q

Summary question: End-2-end is the current trend in the ASR field. True or false?

Question 30

Q

End-2-end systems are better when there are limited resources compared to the ‘conventional’ HMM-DNN systems. True or false?