08 - Speech Synthesis Flashcards

1
Q

Articulatory Speech Synthesis is what in essence?

A

Articulatory synthesis is a technique used to generate speech by replicating the movements of human articulators such as lips, tongue, glottis, and the vocal tract.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Examples of ‘Articulatory Speech Synthesis’ are what?

A

The Von Kempelen speaking machine (1800s) and the Voder from 1939.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a ‘Formant Speech Synthesis’?

A

The formant synthesis methodology follows the simplified source-filter model. The rules to control the model are typically created by linguists in an effort to closely replicate the evolution of the formant structure and other spectral features of normal speech.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is ‘Concatenative Speech Synthesis’?

A

It is an approach that pieces together recorded speech units from a database and minimizes the selection and concatenation costs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

‘Concatenative Speech Synthethis’ requires many recordings from many speakers. True or false?

A

False. It requires many recordings from a single speaker.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

‘Concatenative Speech Synthesis’ is very intelligible, but does not sound natural and lacks emotional expresiveness. True or false?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

‘Statistical Parametric Speech Synthesis’ (SPSS) is what?

A

The basic idea is to generate acoustic parameters first (using HMM), and then recover speech from these parameters using algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is Statistical Parametric Speech Synthesis (SPSS) somewhat more effective than Concatenative Speech Synthesis (CSS)?

A

It requires fewer data and has more flexibility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Of the speech synthesis models (ignoring neural speech synthesis) which is the most intelligible?

A

Concatenative Speech Synthesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The Neural speech synthesis is similar to SPSS, but it changes the HMM and the vocoder components with what?

A

With a DNN component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can the Neural Speech Synthesis become an end-2-end system?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In Text-to-speech (TTS) there are two main forms of features and these are?

A

Acoustic features and linguistic features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the TTS pipeline in simple terms?

A
  1. Extract features (acoustic & linguistic).
  2. Learn the mapping
  3. Predict the features (acoustic & linguistic)
  4. Synthesize waveform (sound signal/speech)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are ‘Non-standard words’?

A

These are words such as:
13th
13
IV (roman)
% (percentage)
12:10 (time)
+- (symbols)
Av. / Ltd. (abbreviations)
NY / CPH (acronyms)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the role of Grapheme-to-Phoneme (G2P)?

A

It is to map graphemes - that is letters or groups of letters - into their respective phoneme - that is speech sounds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a ‘phone’ in relation to linguistics?

A

A phone is a certain speech sound or unit of sound. Often produced by humans.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a RNN-based Tacotron?

A

It is an encoder-decoder with attention. It predicts mel spectrograms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does transformer-based ‘FastSpeech’ predict?

A

It is a transformer that predicts mel spectrograms in parallel!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

The ‘FastSpeech’ has an attention mechanism. True or false?

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

‘FastSpeech’ explicitly predicts duration, f0 and energy and is really fast. True or false?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

‘FastSpeech2’ trains with the accurate spectrum instead of the predicted (AR). Thus the voice quality is better. True or false?

A

True

22
Q

Attention-based models have 3 benefits that are?

A
  1. No alignments is needed
  2. They are adaptable to diverse or noisy datasets
  3. They are capable of more natural prosody (rhythm, intonation etc.)
23
Q

Duration-based models have 4 benefits that are?

A
  1. Fast parallel inference
  2. Less chance of alignment problems
  3. Easier to train if alignments are available
  4. More robust to silence in training data (which you can get rid of using for example a VAD filter)
24
Q

Why was the ‘Vocoder’ initially made?

A

It is conceived to reduce the bandwidth necessary to transmit an intelligible voice.

25
Q

What does the ‘Vocoder’ do in 2 steps?

A
  1. It splits speech in source and frequency bands (acoustic
    features!)
  2. It generates a waveform from the acoustic features (in an attempt to reconstruct the phase information)
26
Q

What is the Grifin-Lim algorithm used for?

A

It is used for reconstructing an audio signal from its magnitude spectrogram. In other words, it is used to generate waveforms.

27
Q

The WaveNet Vocoder is an autoregressive model that is extremely slow. True or false?

A

True

28
Q

What is a ‘Latent Space’?

A

It is a hidden state or bottleneck that contains the compressed knowledge representations.

29
Q

What does ‘Global Style Tokens’ capture?

A

Stylistic attributes or characteristics of speech regardless of language.

30
Q

How are ‘Global Style Tokens’ learned?

A

With large datasets with diverse speech styles.

31
Q

The ‘Global Style Tokens’ compress the latent space and learns from the mel spectrogram. True or false?

A

True

32
Q

The ‘Global Style Tokens’ returna interpretable ‘labels’ that can be used to modify speaking style. True or false?

A

True

33
Q

What characterizes a flow-based model?

A

Flow-based models use a sequence of transformations to break down complex data (like speech waveforms) into simpler components. They learn how to transform a simple distribution into the target distribution, and these transformations can be easily computed both forward and backward. By doing so, flow-based models can understand and generate speech waveforms more effectively.

34
Q

The ‘GlowTTS’ enable fast, diverse and controllable speech synthesis because of flow (a flow-based model). True of false?

A

True

35
Q

What is the ‘Mean Opinion Score’?

A

It is a subjective test with ratings on a scale from 1-5. It needs many native listeners and a carefully thought out design.

36
Q

What is an ‘AB Test’?

A

It is a subjective preference test in which the subjects prefers either A or B. Again it needs many native listeners and a careful design

37
Q

What is the ‘Perceptual Evaluation Speech Quality’ (PESQ) used to assess?

A

It is an evaluation method designed to assess voice communication systems and needs a reference signal. The scores range between -0.5 and 4.5

38
Q

What is the evaluation method ‘Mel Cepstral Distortion’?

A

It is an evaluation method that requires a reference speech signal.
It extracts the MFCC’s from the synthesized and the reference speech signal and calculates the euclidean distance between the two.

Simple: It is the euclidean difference between the synthesized speech and the reference speech.

39
Q

What does the ‘Mel Cepstral Distortion’ not capture?

A

Things like prosody, intonation or pronunciation.

40
Q

Recall the WER which stands for?

A

The Word Error Rate

41
Q

What is the equation for the WER?

A

WER = (S + D + I) / N

42
Q

What does the WER measure?

A

The percentage of incorrectly recognized words in the synthesized output compared to the reference text. It is a good measure of intelligibility.

43
Q

Summary question: Text Analysis refers to?

A

The normalization of text (non-standard words etc.), POS tagging (grammatical tagging), Prosody prediction and G2P conversion.

44
Q

Summary question: Acoustic model is the?

A

Model that generates the intermediate spectrogram with SPSS, RNN or transformers.

45
Q

Summary question: Waveform generation is made with?

A

Griffin-Lim, WaveNet or HiFiGAN vocoders.

46
Q

Summary question: Speaker and style embeddings refer to the?

A

The latent space and global style tokens

47
Q

Summary question: End-to-end models in speech synthesis are for example?

A

Flow-based models, GlowTTS and VITS.

48
Q

Summary question: Speech synthesis evaluation tools are both objective and subjective. True or false?

A

True

49
Q
A
50
Q

The VITS is a combination of GlowTTS and the HiFiGAN vocoder and uses monotonic alignment search. True or false?

A

True

51
Q

The HiFiGAN Vocoder uses a Generative Adversarial Network (GAN) and the generator is a fully connvolutional network. True or false?

A

True