08 - Speech Synthesis Flashcards
Articulatory Speech Synthesis is what in essence?
Articulatory synthesis is a technique used to generate speech by replicating the movements of human articulators such as lips, tongue, glottis, and the vocal tract.
Examples of ‘Articulatory Speech Synthesis’ are what?
The Von Kempelen speaking machine (1800s) and the Voder from 1939.
What is a ‘Formant Speech Synthesis’?
The formant synthesis methodology follows the simplified source-filter model. The rules to control the model are typically created by linguists in an effort to closely replicate the evolution of the formant structure and other spectral features of normal speech.
What is ‘Concatenative Speech Synthesis’?
It is an approach that pieces together recorded speech units from a database and minimizes the selection and concatenation costs.
‘Concatenative Speech Synthethis’ requires many recordings from many speakers. True or false?
False. It requires many recordings from a single speaker.
‘Concatenative Speech Synthesis’ is very intelligible, but does not sound natural and lacks emotional expresiveness. True or false?
True
‘Statistical Parametric Speech Synthesis’ (SPSS) is what?
The basic idea is to generate acoustic parameters first (using HMM), and then recover speech from these parameters using algorithms.
Why is Statistical Parametric Speech Synthesis (SPSS) somewhat more effective than Concatenative Speech Synthesis (CSS)?
It requires fewer data and has more flexibility.
Of the speech synthesis models (ignoring neural speech synthesis) which is the most intelligible?
Concatenative Speech Synthesis.
The Neural speech synthesis is similar to SPSS, but it changes the HMM and the vocoder components with what?
With a DNN component.
Can the Neural Speech Synthesis become an end-2-end system?
Yes
In Text-to-speech (TTS) there are two main forms of features and these are?
Acoustic features and linguistic features.
What is the TTS pipeline in simple terms?
- Extract features (acoustic & linguistic).
- Learn the mapping
- Predict the features (acoustic & linguistic)
- Synthesize waveform (sound signal/speech)
What are ‘Non-standard words’?
These are words such as:
13th
13
IV (roman)
% (percentage)
12:10 (time)
+- (symbols)
Av. / Ltd. (abbreviations)
NY / CPH (acronyms)
What is the role of Grapheme-to-Phoneme (G2P)?
It is to map graphemes - that is letters or groups of letters - into their respective phoneme - that is speech sounds.
What is a ‘phone’ in relation to linguistics?
A phone is a certain speech sound or unit of sound. Often produced by humans.
What is a RNN-based Tacotron?
It is an encoder-decoder with attention. It predicts mel spectrograms.
What does transformer-based ‘FastSpeech’ predict?
It is a transformer that predicts mel spectrograms in parallel!
The ‘FastSpeech’ has an attention mechanism. True or false?
False
‘FastSpeech’ explicitly predicts duration, f0 and energy and is really fast. True or false?
True