Week 8 Flashcards
State the audio features that have been used for emotion recognition
1) pitch, intensity & duration of speech
2) spectral energy distribution
3) Mel Frequency Cepstral Coefficients (MFCCs) & ∆MFCCs
4) average Zero Crossings Density (ZCD)
5) filter-bank energies (FBE) parameters
Given a speech signal s, how can its pitch and intensity (or loudness) be measured?
1) Pitch via DFT (equ. 17.1)
2) intensity, syllable peak (equ. 17.3)
What is meant by harmonics-to-noise ratio?
- represents rough voice
- ≈20 dB (i.e., 99% of energy of the signal is periodic, & the rest is noise) for young speakers
What is the cepstrum of a speech signal?
1) equ. 17.10
2) cepstrum of sound source:
-voiced, pulses around n = 0, T, 2T, . . . where T is pitch
period, fundamental frequency (or pitch)
- unvoiced: no distinct features
3) Cepstrum of vocal tract IR:
- non-negligible for small number of samples
- number of samples < number of samples in a pitch period
- a low-time cepstrum window will extract it
4) Cepstrum Coefficients (CC) for separating signal from vocal tract IR e.g., use low coefficients to analyse
vocal tract
Discuss the usefulness of audio features for emotion recognition from speech.
1) Not best feature set or most efficient recognition model ⇒ data-dependent
2) Select set of relevant features:
- prosodic features, e.g., pitch, energy, speaking rate
- lexical, disfluency, & discourse cues
- acoustic features, e.g., zero-crossing, spectrum, formants, MFCCs
3) Data-dependent ⇒
- use as many features as possible
- optimise choice of features with the recognition algorithm
What are the characteristics of intonation groups of a pitch contour that are selected for feature extraction for emotion recognition?
1) complete IGs with largest pitch range or duration
2) monotonically decreasing or increasing IGs with largest pitch range or duration
3) monotonically decreasing or increasing IGs at start or end of a sentence
What are the three types of intonation groups that are labelled in Figure 2? Which are selected for feature extraction?
1) Type 1: A complete pitch segment that starts from the point of a pitch rise
to the point of the next pitch rise
2) Type 2: A monotonically decreasing pitch segment
3) Type 3: A monotonically increasing pitch segment
4) Type 2 and 3
What is meant by jitter and shimmer in a speech signal?
1) Jitter:
- cycle-to-cycle variations of pitch period
- affected by lack of control of vibration of the vocal cords
- patients with pathologies, higher value of jitter
- 2 measures: absolute and relative
2) Shimmer:
- cycle-to-cycle variations of the amplitude of pitch
period
- changes with reduction of glottal resistance & mass lesions on the vocal cords
- patients with pathologies - higher values of
shimmer
- 2 measures: absolute and relative
Motivation for IG feature extraction:
1) emotional state can be characterised by many speech features
2) trembling speech, unvoiced speech, speech duration & hesitation - useful characteristics for emotion detection
What articulatory mechanisms are involved in emotionally expressive speech?
1) emotional speech is associated more with peripheral articulatory motions than that of Neutral speech
2) tongue tip (TT), jaw & lip positioning are more advanced (extreme) in
emotional speech than in Neutral speech
3) classification recalls of using articulatory features are higher than those of
acoustic features
4) ⇒ articulatory features carry valuable emotion-dependent information
Methods for collecting articulatory data
ultrasound, x-ray microbeam, electromagnetic
articulography (EMA), & magnetic resonance imaging (MRI)
How foes EMA work?
1) measures position of parts of the mouth
2) uses sensor coils placed on tongue & other parts of mouth
3) induction coils around the head produce an electromagnetic field that induces a current in the sensors in the mouth
4) induced current is inversely proportional to the cube of the distance ⇒ sensor coil’s location
What are the current challenges with emotion recognition via speech?
1) Variability in speakers:
- Inter-speaker variability, heterogeneous display of emotion & differences in individual’s vocal tract structures
- Intra-speaker variability: a speaker’s ability to express an emotion in a number of ways & can be influenced by the context
2) Variant nature of controls of speech production
components
State the three types of acoustic low-level descriptors for emotion recognition,
and their characteristics.
1) Prosody - related to rhythm, stress & intonation of speech, and include fundamental frequency (fo) or pitch, short-term energy, speech rate
2) Spectral characteristics - related to harmonic/resonant
structures, & include
Mel-frequency cepstral coefficients (MFCCs), Mel-filter bank energy coefficients (MFBs)
3) Voice quality-related measures - related to characteristics of vocal cord vibrations, & include
jitter, shimmer
harmonic-to-noise ratio
Processing steps for acoustic LLDs
1) Compute various statistical functionals (i.e., mean, standard deviation) on these LLDs at different time scale granularities (e.g., at 0.1, 0.5, 1, & 10 sec, etc.)
2) Compute the dynamics at multilevel time scales using
statistical functional operators
3) Select features to reduce dimension using: stand-alone method or mutual information