Week 8 Flashcards
State the audio features that have been used for emotion recognition
1) pitch, intensity & duration of speech
2) spectral energy distribution
3) Mel Frequency Cepstral Coefficients (MFCCs) & ∆MFCCs
4) average Zero Crossings Density (ZCD)
5) filter-bank energies (FBE) parameters
Given a speech signal s, how can its pitch and intensity (or loudness) be measured?
1) Pitch via DFT (equ. 17.1)
2) intensity, syllable peak (equ. 17.3)
What is meant by harmonics-to-noise ratio?
- represents rough voice
- ≈20 dB (i.e., 99% of energy of the signal is periodic, & the rest is noise) for young speakers
What is the cepstrum of a speech signal?
1) equ. 17.10
2) cepstrum of sound source:
-voiced, pulses around n = 0, T, 2T, . . . where T is pitch
period, fundamental frequency (or pitch)
- unvoiced: no distinct features
3) Cepstrum of vocal tract IR:
- non-negligible for small number of samples
- number of samples < number of samples in a pitch period
- a low-time cepstrum window will extract it
4) Cepstrum Coefficients (CC) for separating signal from vocal tract IR e.g., use low coefficients to analyse
vocal tract
Discuss the usefulness of audio features for emotion recognition from speech.
1) Not best feature set or most efficient recognition model ⇒ data-dependent
2) Select set of relevant features:
- prosodic features, e.g., pitch, energy, speaking rate
- lexical, disfluency, & discourse cues
- acoustic features, e.g., zero-crossing, spectrum, formants, MFCCs
3) Data-dependent ⇒
- use as many features as possible
- optimise choice of features with the recognition algorithm
What are the characteristics of intonation groups of a pitch contour that are selected for feature extraction for emotion recognition?
1) complete IGs with largest pitch range or duration
2) monotonically decreasing or increasing IGs with largest pitch range or duration
3) monotonically decreasing or increasing IGs at start or end of a sentence
What are the three types of intonation groups that are labelled in Figure 2? Which are selected for feature extraction?
1) Type 1: A complete pitch segment that starts from the point of a pitch rise
to the point of the next pitch rise
2) Type 2: A monotonically decreasing pitch segment
3) Type 3: A monotonically increasing pitch segment
4) Type 2 and 3
What is meant by jitter and shimmer in a speech signal?
1) Jitter:
- cycle-to-cycle variations of pitch period
- affected by lack of control of vibration of the vocal cords
- patients with pathologies, higher value of jitter
- 2 measures: absolute and relative
2) Shimmer:
- cycle-to-cycle variations of the amplitude of pitch
period
- changes with reduction of glottal resistance & mass lesions on the vocal cords
- patients with pathologies - higher values of
shimmer
- 2 measures: absolute and relative
Motivation for IG feature extraction:
1) emotional state can be characterised by many speech features
2) trembling speech, unvoiced speech, speech duration & hesitation - useful characteristics for emotion detection
What articulatory mechanisms are involved in emotionally expressive speech?
1) emotional speech is associated more with peripheral articulatory motions than that of Neutral speech
2) tongue tip (TT), jaw & lip positioning are more advanced (extreme) in
emotional speech than in Neutral speech
3) classification recalls of using articulatory features are higher than those of
acoustic features
4) ⇒ articulatory features carry valuable emotion-dependent information
Methods for collecting articulatory data
ultrasound, x-ray microbeam, electromagnetic
articulography (EMA), & magnetic resonance imaging (MRI)
How foes EMA work?
1) measures position of parts of the mouth
2) uses sensor coils placed on tongue & other parts of mouth
3) induction coils around the head produce an electromagnetic field that induces a current in the sensors in the mouth
4) induced current is inversely proportional to the cube of the distance ⇒ sensor coil’s location
What are the current challenges with emotion recognition via speech?
1) Variability in speakers:
- Inter-speaker variability, heterogeneous display of emotion & differences in individual’s vocal tract structures
- Intra-speaker variability: a speaker’s ability to express an emotion in a number of ways & can be influenced by the context
2) Variant nature of controls of speech production
components
State the three types of acoustic low-level descriptors for emotion recognition,
and their characteristics.
1) Prosody - related to rhythm, stress & intonation of speech, and include fundamental frequency (fo) or pitch, short-term energy, speech rate
2) Spectral characteristics - related to harmonic/resonant
structures, & include
Mel-frequency cepstral coefficients (MFCCs), Mel-filter bank energy coefficients (MFBs)
3) Voice quality-related measures - related to characteristics of vocal cord vibrations, & include
jitter, shimmer
harmonic-to-noise ratio
Processing steps for acoustic LLDs
1) Compute various statistical functionals (i.e., mean, standard deviation) on these LLDs at different time scale granularities (e.g., at 0.1, 0.5, 1, & 10 sec, etc.)
2) Compute the dynamics at multilevel time scales using
statistical functional operators
3) Select features to reduce dimension using: stand-alone method or mutual information
Affect recognition and modelling using speech requires
1) annotation scheme of emotion labels
2) feature normalisation technique
3) context-aware machine learning frameworks
Emotion labelling, self-reports vs perceived rating
1) self-reports - ask subjects to recall how they have felt during a particular interaction
2) perceived rating - ask trained observers to assign emotion labels as they
watch a given audio video recording
What are the three approaches to global normalisation of acoustic features? Why are they not always effective?
1) z-normalisation -transforms features by subtracting their mean & dividing by their standard deviation ⇒ each feature with zero mean & unit variance across all data
2) min-max - scales feature to predefined range
3) nonlinear normalisation -convert features’ distributions into normal distributions
4) not always effective as a single normalisation scheme across the entire corpus can adversely
affect the emotional discrimination of the feature
With the aid of a diagram, outline the operation of the Iterative Feature Normalisation.
1) Normalise features by estimating the parameters of an affine transformation
(e. g., z-normalisation) using only neutral (non-emotional) samples
2) Iterative because neutral samples may not be available for every individual
3) Front-end scheme estimates the neutral subset of the data iteratively
4) ⇒ to estimate the normalisation parameters
5) Robust against different recording conditions
Motivation for static emotion recognition for single utterance via hierarchical tree
1) to map an utterance to a predefined set of categorical emotion classes given acoustic features
2) appraisal theory of emotion (i.e., emotion is theorised to be in stages of a stimulus)
3) ⇒ Based on first processing the clear perceptual differences of emotion information in the acoustic features at root of tree, & highly ambiguous emotions are recognised
at leaves of tree
Hierarchical tree for static emotion recognition for single utterance
1) Levels in tree solve the easiest classification tasks first, mitigates error propagation
2) Each node of tree a binary classifier i
3) Leaves of tree identify most ambiguous emotion class, which often is the neutral class
What is recall?
Recall indicates what proportion of actual positives (i.e.,
correct emotion class) is identified correctly. Calculated: TP/(TP+FN)
How could the physiological theory of emotion be exploited in affective computing?
1) Physical response - primary to feeling of an emotion
2) Stimulus ⇒ activity in autonomic nervous system (ANS) ⇒ emotional response in brain
3) Specific patterns of response correspond to specific emotions:
- anger, increased blood flow to hands, increased heart rate, snarling, increased involuntary nervous system arousal
- fear ⇒ high arousal state, decreased voluntary muscle activity, greater involuntary muscular contractions, decreased circulation in the peripheral blood vessels
What bodily changes have been exploited for emotion recognition?
All bodily changes that can be sensed from surface of the skin & reflect ANS activity:
1) cardiac activity (heart rate, heart-rate variability, & blood
volume pulse)
2) galvanic skin response (skin conductivity)
3) surface electromyography (EMG)
4) respiration through expansion of chest cavity