08 - Speech Synthesis Flashcards by Joachim Andreasen

Articulatory Speech Synthesis is what in essence?

Articulatory synthesis is a technique used to generate speech by replicating the movements of human articulators such as lips, tongue, glottis, and the vocal tract.

How well did you know this?

Not at all

Perfectly

Examples of ‘Articulatory Speech Synthesis’ are what?

The Von Kempelen speaking machine (1800s) and the Voder from 1939.

How well did you know this?

Not at all

Perfectly

What is a ‘Formant Speech Synthesis’?

The formant synthesis methodology follows the simplified source-filter model. The rules to control the model are typically created by linguists in an effort to closely replicate the evolution of the formant structure and other spectral features of normal speech.

How well did you know this?

Not at all

Perfectly

What is ‘Concatenative Speech Synthesis’?

It is an approach that pieces together recorded speech units from a database and minimizes the selection and concatenation costs.

How well did you know this?

Not at all

Perfectly

‘Concatenative Speech Synthethis’ requires many recordings from many speakers. True or false?

False. It requires many recordings from a single speaker.

How well did you know this?

Not at all

Perfectly

‘Concatenative Speech Synthesis’ is very intelligible, but does not sound natural and lacks emotional expresiveness. True or false?

True

How well did you know this?

Not at all

Perfectly

‘Statistical Parametric Speech Synthesis’ (SPSS) is what?

The basic idea is to generate acoustic parameters first (using HMM), and then recover speech from these parameters using algorithms.

How well did you know this?

Not at all

Perfectly

Why is Statistical Parametric Speech Synthesis (SPSS) somewhat more effective than Concatenative Speech Synthesis (CSS)?

It requires fewer data and has more flexibility.

How well did you know this?

Not at all

Perfectly

Of the speech synthesis models (ignoring neural speech synthesis) which is the most intelligible?

Concatenative Speech Synthesis.

How well did you know this?

Not at all

Perfectly

The Neural speech synthesis is similar to SPSS, but it changes the HMM and the vocoder components with what?

With a DNN component.

How well did you know this?

Not at all

Perfectly

Can the Neural Speech Synthesis become an end-2-end system?

Yes

How well did you know this?

Not at all

Perfectly

In Text-to-speech (TTS) there are two main forms of features and these are?

Acoustic features and linguistic features.

How well did you know this?

Not at all

Perfectly

What is the TTS pipeline in simple terms?

Extract features (acoustic & linguistic).
Learn the mapping
Predict the features (acoustic & linguistic)
Synthesize waveform (sound signal/speech)

How well did you know this?

Not at all

Perfectly

What are ‘Non-standard words’?

These are words such as:
13th
13
IV (roman)
% (percentage)
12:10 (time)
+- (symbols)
Av. / Ltd. (abbreviations)
NY / CPH (acronyms)

How well did you know this?

Not at all

Perfectly

What is the role of Grapheme-to-Phoneme (G2P)?

It is to map graphemes - that is letters or groups of letters - into their respective phoneme - that is speech sounds.

How well did you know this?

Not at all

Perfectly

What is a ‘phone’ in relation to linguistics?

A phone is a certain speech sound or unit of sound. Often produced by humans.

How well did you know this?

Not at all

Perfectly

What is a RNN-based Tacotron?

It is an encoder-decoder with attention. It predicts mel spectrograms.

How well did you know this?

Not at all

Perfectly

What does transformer-based ‘FastSpeech’ predict?

It is a transformer that predicts mel spectrograms in parallel!

How well did you know this?

Not at all

Perfectly

The ‘FastSpeech’ has an attention mechanism. True or false?

False

How well did you know this?

Not at all

Perfectly

‘FastSpeech’ explicitly predicts duration, f0 and energy and is really fast. True or false?

True

How well did you know this?

Not at all

Perfectly

‘FastSpeech2’ trains with the accurate spectrum instead of the predicted (AR). Thus the voice quality is better. True or false?

Study These Flashcards

True

Attention-based models have 3 benefits that are?

Study These Flashcards

No alignments is needed
They are adaptable to diverse or noisy datasets
They are capable of more natural prosody (rhythm, intonation etc.)

Duration-based models have 4 benefits that are?

Study These Flashcards

Fast parallel inference
Less chance of alignment problems
Easier to train if alignments are available
More robust to silence in training data (which you can get rid of using for example a VAD filter)

Why was the ‘Vocoder’ initially made?

Study These Flashcards

It is conceived to reduce the bandwidth necessary to transmit an intelligible voice.

What does the 'Vocoder' do in 2 steps?

1. It splits speech in source and frequency bands (acoustic features!) 2. It generates a waveform from the acoustic features (in an attempt to reconstruct the phase information)

What is the Grifin-Lim algorithm used for?

It is used for reconstructing an audio signal from its magnitude spectrogram. In other words, it is used to generate waveforms.

The WaveNet Vocoder is an autoregressive model that is extremely slow. True or false?

True

What is a 'Latent Space'?

It is a hidden state or bottleneck that contains the compressed knowledge representations.

What does 'Global Style Tokens' capture?

Stylistic attributes or characteristics of speech regardless of language.

How are 'Global Style Tokens' learned?

With large datasets with diverse speech styles.

The 'Global Style Tokens' compress the latent space and learns from the mel spectrogram. True or false?

True

The 'Global Style Tokens' returna interpretable 'labels' that can be used to modify speaking style. True or false?

True

What characterizes a flow-based model?

Flow-based models use a sequence of transformations to break down complex data (like speech waveforms) into simpler components. They learn how to transform a simple distribution into the target distribution, and these transformations can be easily computed both forward and backward. By doing so, flow-based models can understand and generate speech waveforms more effectively.

The 'GlowTTS' enable fast, diverse and controllable speech synthesis because of flow (a flow-based model). True of false?

True

What is the 'Mean Opinion Score'?

It is a subjective test with ratings on a scale from 1-5. It needs many native listeners and a carefully thought out design.

What is an 'AB Test'?

It is a subjective preference test in which the subjects prefers either A or B. Again it needs many native listeners and a careful design

What is the 'Perceptual Evaluation Speech Quality' (PESQ) used to assess?

It is an evaluation method designed to assess voice communication systems and needs a reference signal. The scores range between -0.5 and 4.5

What is the evaluation method 'Mel Cepstral Distortion'?

It is an evaluation method that requires a reference speech signal. It extracts the MFCC's from the synthesized and the reference speech signal and calculates the euclidean distance between the two. Simple: It is the euclidean difference between the synthesized speech and the reference speech.

What does the 'Mel Cepstral Distortion' not capture?

Things like prosody, intonation or pronunciation.

Recall the WER which stands for?

The Word Error Rate

What is the equation for the WER?

WER = (S + D + I) / N

What does the WER measure?

The percentage of incorrectly recognized words in the synthesized output compared to the reference text. It is a good measure of intelligibility.

Summary question: Text Analysis refers to?

The normalization of text (non-standard words etc.), POS tagging (grammatical tagging), Prosody prediction and G2P conversion.

Summary question: Acoustic model is the?

Model that generates the intermediate spectrogram with SPSS, RNN or transformers.

Summary question: Waveform generation is made with?

Griffin-Lim, WaveNet or HiFiGAN vocoders.

Summary question: Speaker and style embeddings refer to the?

The latent space and global style tokens

Summary question: End-to-end models in speech synthesis are for example?

Flow-based models, GlowTTS and VITS.

Summary question: Speech synthesis evaluation tools are both objective and subjective. True or false?

True

The VITS is a combination of GlowTTS and the HiFiGAN vocoder and uses monotonic alignment search. True or false?

True

The HiFiGAN Vocoder uses a Generative Adversarial Network (GAN) and the generator is a fully connvolutional network. True or false?

True

08 - Speech Synthesis Flashcards

(51 cards)