06 - Automatic Speech Recognition Flashcards by Joachim Andreasen

What makes ASR difficult (name two)?

Size, speaker, acoustic environment, style, accent/dialect, languages (amount of data etc.)

How well did you know this?

Not at all

Perfectly

What is ASR built from in two concepts (think front end and back end)?

A front-end feature extraction and a back-end classification.

How well did you know this?

Not at all

Perfectly

ASR is not a sequence2sequence problem. True or false?

False

How well did you know this?

Not at all

Perfectly

What is the main trend in the third generation (3G) of ASR?

Hierarchical modelling of speech.
AM –> LM –> PM (using HMM and GMM)

How well did you know this?

Not at all

Perfectly

What is the main trend in the fourth generation (4G) of ASR?

End2end. Direct mapping from acoustics to words/characters.

How well did you know this?

Not at all

Perfectly

What is the most important metric in ASR?

WER = Word Error Rate

How well did you know this?

Not at all

Perfectly

What is the equation for the Word Error Rate (WER)?

WER = (I + S + D) / N

Simple:
I = Insertion
S = Substitution
D = Deletion

I: The number of insertion errors, which represents the additional or extra words inserted in the output compared to the reference (ground truth).
S: The number of substitution errors, which indicates the words that are substituted or replaced in the output compared to the reference.
D: The number of deletion errors, which represents the words that are missing or deleted in the output compared to the reference.
N: The total number of words in the reference (ground truth) transcript.

How well did you know this?

Not at all

Perfectly

What are the two building blocks of the ‘conventional’ build of an Acoustic Model in the ASR?

It is the HMM (Hidden Markov Models) and the GMM (Gaussian Mixture Models)

How well did you know this?

Not at all

Perfectly

What are the three basic HMM problems?

Evaluation/scoring (forward algorithm)
Decoding (viterbi algorithm)
Training (baum-welch algorithm)

How well did you know this?

Not at all

Perfectly

What does the ‘Forward Algorithm’ do?

It permits computing the probability P(X|Ø) based on dynamic programming in O(N^2) time.

How well did you know this?

Not at all

Perfectly

What is the ‘Viterbi Algorithm’ used for?

It is used to find the best state sequence.

How well did you know this?

Not at all

Perfectly

What does the ‘Baum-Welch algorithm’ do?

It makes use of the forward probabilites and the backward probabilities to estimate the state occupation probability.

How well did you know this?

Not at all

Perfectly

What probability does the ‘Acoustic Model’ calculate?

P(X|W). That is, the probability of a vector sequence, X given an utterance, W.

How well did you know this?

Not at all

Perfectly

What does the language Model calculate the probability of?

P(W), that is. The probability of a given utterance, W.

How well did you know this?

Not at all

Perfectly

What is CFG?

Context Free Grammar

How well did you know this?

Not at all

Perfectly

When do we use CFG instead of n-gram?

Study These Flashcards

When the sentences or phrases are complete and non-complex. Very simple phrases for example.

When do we use n-grams instead of CFG (context free grammar)?

Study These Flashcards

When there are comlex and maybe incoherent senteces or phrases. It is used when word sequence probabilities are important, e.g. in text prediction and speech classification.

What do we use the ‘Pronunciation model’ for?

Study These Flashcards

To find the optimal, W. That is utterance.

We cannot perform continuous speech recognition with HMM’s. True or false?

Study These Flashcards

False. We can and do.

What does ‘LVCSR’ stand for?

Study These Flashcards

Large Vocabulary Continuous Speech Recognition

When we turn our attention to more modern approaches, the observation likelihood used in the ‘conventional’ methods is replaced by?

Study These Flashcards

A (scaled) neural network posterier.

Some disadvantes of using Neural Networks until 2012 were?

Study These Flashcards

It was computationally expensive (still is) - moreso back then.
Neural network systems were less complex before then.

What are the differences (progress) in Automatic Speech Recognition after 2012?

Study These Flashcards

We have deeper and wider networks and our computers and GPU’s allow for much faster training.

Time-delay Neural Networks (TDNN’s) model richer content. True or false?

Study These Flashcards

True.

Since GMM/HMM was proposed the approaches were more?

Statistical.

Hierarchical (3G) systems is often split into 3 (sometimes more) concepts. These are types of models, and are commonly what kind of models?

Acoustic model, language model (n-grams) and pronunciation model. Also a decoder sometimes.

We see a huge improvement in Large Vocabulary Continuous Speech Recognition (LVCSR) since 2012 due to what?

The positive improvement and impact of deep learning.

Newer deep learning ASR systems replace the acoustic model of a hierarchical/statistical system by what?

A deep neural network. This gives better accuracy and context modelling.

06 - Automatic Speech Recognition Flashcards

(28 cards)