Topic 11: Speech Synthesis Flashcards
Recap on speech synthesis block diagram
text analysis -> phonetic analysis -> prosodic analysis -> speech synthesis
Text analysis
document structure detection - this will determine how TTS will be implemented
text normalization
linguistic analysis
Document structure detection
flat file or file content
Text normalization: transliteration
convert the text into a standardized format..
example:
transliteration..
how to map hangeul to standard form
hangeul consonant
hangeul vowel
example of transliateration
this is not translation..is mapping the prnounciation into standard form???
Text normalization: dealing with different format
symbols
number format
combination
abbreviation and acronym
normalizing numbers
- phone number
- dates
- times
- money and currency
- account number
- ordinal number
- cardinal number
Text normalization
diff prob need diff approach
can use RE
- test pattern
- replace
- search substring
example
- extract substring
- replace
- test
Linguistic Analysis
processing text based on linguistic feature of the language
support phonetic and prosodic generation
modular function required for TTS..
- Sentence breaking/tokenizer
- POS (give example)
- homograph disambiguation (example)
- noun phrase & clause detection
- sentence type disambiguation
Phonetic Analysis
conversion of grapheme to phoneme
written word to pronunciation form
Prosodic analysis
prosody is the melody of speech
syle, rhythm, timbre
intonation
stylisation of sound..how you tune the sound
acoustic feature controllable is limited
- pitch / f0
- duration
- intensity
problem with modifying intensity is the frequency will change as well
pitch and duration is best not construct from scratch
Klatt formant synthesizer is the voice filter model to create speech by manipulating pitch, noise and formants information
speech quality is not ok
speech segment concatenation is used
Klatt Duration Model
Klatt study duration model for English phoneme
based on thousands of samples, the obtained basic duration is called inherent duration
final duration is dependent on
- categories of neighbouring unit
- position of phoneme in syllable
- other constituent of the syllable
- position of the syllable into word
Prosodic analysis : MBROLA
a speech synthesiser engine based on concatenation of diphone
concatenation if diphone waveform using TD-PSOLA
can manipulate duration and pitch
Prosodic analysis : MOMEL
Modelization of melodie
INSTINT - internation transcription system for intonation
objective
- how to model melody of speech
- how to represent the intonation without paying attention to what language being analyzed.
MOMEL follow existing pitch by using quadratic spline function
voiceless part is interpolated as well so no discontinuities