Week 8 Flashcards

1
Q

State the audio features that have been used for emotion recognition

A

1) pitch, intensity & duration of speech
2) spectral energy distribution
3) Mel Frequency Cepstral Coefficients (MFCCs) & ∆MFCCs
4) average Zero Crossings Density (ZCD)
5) filter-bank energies (FBE) parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Given a speech signal s, how can its pitch and intensity (or loudness) be measured?

A

1) Pitch via DFT (equ. 17.1)

2) intensity, syllable peak (equ. 17.3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is meant by harmonics-to-noise ratio?

A
  • represents rough voice

- ≈20 dB (i.e., 99% of energy of the signal is periodic, & the rest is noise) for young speakers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the cepstrum of a speech signal?

A

1) equ. 17.10
2) cepstrum of sound source:
-voiced, pulses around n = 0, T, 2T, . . . where T is pitch
period, fundamental frequency (or pitch)
- unvoiced: no distinct features
3) Cepstrum of vocal tract IR:
- non-negligible for small number of samples
- number of samples < number of samples in a pitch period
- a low-time cepstrum window will extract it
4) Cepstrum Coefficients (CC) for separating signal from vocal tract IR e.g., use low coefficients to analyse
vocal tract

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Discuss the usefulness of audio features for emotion recognition from speech.

A

1) Not best feature set or most efficient recognition model ⇒ data-dependent
2) Select set of relevant features:
- prosodic features, e.g., pitch, energy, speaking rate
- lexical, disfluency, & discourse cues
- acoustic features, e.g., zero-crossing, spectrum, formants, MFCCs
3) Data-dependent ⇒
- use as many features as possible
- optimise choice of features with the recognition algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the characteristics of intonation groups of a pitch contour that are selected for feature extraction for emotion recognition?

A

1) complete IGs with largest pitch range or duration
2) monotonically decreasing or increasing IGs with largest pitch range or duration
3) monotonically decreasing or increasing IGs at start or end of a sentence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the three types of intonation groups that are labelled in Figure 2? Which are selected for feature extraction?

A

1) Type 1: A complete pitch segment that starts from the point of a pitch rise
to the point of the next pitch rise
2) Type 2: A monotonically decreasing pitch segment
3) Type 3: A monotonically increasing pitch segment
4) Type 2 and 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is meant by jitter and shimmer in a speech signal?

A

1) Jitter:
- cycle-to-cycle variations of pitch period
- affected by lack of control of vibration of the vocal cords
- patients with pathologies, higher value of jitter
- 2 measures: absolute and relative
2) Shimmer:
- cycle-to-cycle variations of the amplitude of pitch
period
- changes with reduction of glottal resistance & mass lesions on the vocal cords
- patients with pathologies - higher values of
shimmer
- 2 measures: absolute and relative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Motivation for IG feature extraction:

A

1) emotional state can be characterised by many speech features
2) trembling speech, unvoiced speech, speech duration & hesitation - useful characteristics for emotion detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What articulatory mechanisms are involved in emotionally expressive speech?

A

1) emotional speech is associated more with peripheral articulatory motions than that of Neutral speech
2) tongue tip (TT), jaw & lip positioning are more advanced (extreme) in
emotional speech than in Neutral speech
3) classification recalls of using articulatory features are higher than those of
acoustic features
4) ⇒ articulatory features carry valuable emotion-dependent information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Methods for collecting articulatory data

A

ultrasound, x-ray microbeam, electromagnetic

articulography (EMA), & magnetic resonance imaging (MRI)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How foes EMA work?

A

1) measures position of parts of the mouth
2) uses sensor coils placed on tongue & other parts of mouth
3) induction coils around the head produce an electromagnetic field that induces a current in the sensors in the mouth
4) induced current is inversely proportional to the cube of the distance ⇒ sensor coil’s location

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the current challenges with emotion recognition via speech?

A

1) Variability in speakers:
- Inter-speaker variability, heterogeneous display of emotion & differences in individual’s vocal tract structures
- Intra-speaker variability: a speaker’s ability to express an emotion in a number of ways & can be influenced by the context
2) Variant nature of controls of speech production
components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

State the three types of acoustic low-level descriptors for emotion recognition,
and their characteristics.

A
1) Prosody - related to rhythm, stress & intonation of speech, and include
fundamental frequency (fo) or pitch, short-term energy, speech rate

2) Spectral characteristics - related to harmonic/resonant
structures, & include
Mel-frequency cepstral coefficients (MFCCs), Mel-filter bank energy coefficients (MFBs)

3) Voice quality-related measures - related to characteristics of vocal cord vibrations, & include
jitter, shimmer
harmonic-to-noise ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Processing steps for acoustic LLDs

A

1) Compute various statistical functionals (i.e., mean, standard deviation) on these LLDs at different time scale granularities (e.g., at 0.1, 0.5, 1, & 10 sec, etc.)

2) Compute the dynamics at multilevel time scales using
statistical functional operators

3) Select features to reduce dimension using: stand-alone method or mutual information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Affect recognition and modelling using speech requires

A

1) annotation scheme of emotion labels
2) feature normalisation technique
3) context-aware machine learning frameworks

17
Q

Emotion labelling, self-reports vs perceived rating

A

1) self-reports - ask subjects to recall how they have felt during a particular interaction

2) perceived rating - ask trained observers to assign emotion labels as they
watch a given audio video recording

18
Q

What are the three approaches to global normalisation of acoustic features? Why are they not always effective?

A

1) z-normalisation -transforms features by subtracting their mean & dividing by their standard deviation ⇒ each feature with zero mean & unit variance across all data
2) min-max - scales feature to predefined range
3) nonlinear normalisation -convert features’ distributions into normal distributions

4) not always effective as a single normalisation scheme across the entire corpus can adversely
affect the emotional discrimination of the feature

19
Q

With the aid of a diagram, outline the operation of the Iterative Feature Normalisation.

A

1) Normalise features by estimating the parameters of an affine transformation
(e. g., z-normalisation) using only neutral (non-emotional) samples

2) Iterative because neutral samples may not be available for every individual
3) Front-end scheme estimates the neutral subset of the data iteratively
4) ⇒ to estimate the normalisation parameters
5) Robust against different recording conditions

20
Q

Motivation for static emotion recognition for single utterance via hierarchical tree

A

1) to map an utterance to a predefined set of categorical emotion classes given acoustic features
2) appraisal theory of emotion (i.e., emotion is theorised to be in stages of a stimulus)
3) ⇒ Based on first processing the clear perceptual differences of emotion information in the acoustic features at root of tree, & highly ambiguous emotions are recognised
at leaves of tree

21
Q

Hierarchical tree for static emotion recognition for single utterance

A

1) Levels in tree solve the easiest classification tasks first, mitigates error propagation
2) Each node of tree a binary classifier i
3) Leaves of tree identify most ambiguous emotion class, which often is the neutral class

22
Q

What is recall?

A

Recall indicates what proportion of actual positives (i.e.,

correct emotion class) is identified correctly. Calculated: TP/(TP+FN)

23
Q

How could the physiological theory of emotion be exploited in affective computing?

A

1) Physical response - primary to feeling of an emotion
2) Stimulus ⇒ activity in autonomic nervous system (ANS) ⇒ emotional response in brain
3) Specific patterns of response correspond to specific emotions:
- anger, increased blood flow to hands, increased heart rate, snarling, increased involuntary nervous system arousal
- fear ⇒ high arousal state, decreased voluntary muscle activity, greater involuntary muscular contractions, decreased circulation in the peripheral blood vessels

24
Q

What bodily changes have been exploited for emotion recognition?

A

All bodily changes that can be sensed from surface of the skin & reflect ANS activity:
1) cardiac activity (heart rate, heart-rate variability, & blood
volume pulse)
2) galvanic skin response (skin conductivity)
3) surface electromyography (EMG)
4) respiration through expansion of chest cavity

25
Q

Explain ECG

A

1) ventricles depolarise, pumping blood into the rest of body, while the atria polarise by expanding their volume, receiving blood from the body ⇒ ventricles polarise, & the process is repeated
2) electrical changes due to polarisation of chambers detectable (ECG)
3) Inflection points P,Q,R,S,T:
- P, atrial depolarisation
- QRS, ventricular depolarisation
- T, ventricular repolarisation

26
Q

Drawbacks of ECG

A
Requires:
1) contact of electrode
adhesive patch with
subject’s skin
2) excess hair removed
3) skin cleaned at adhesion sites
27
Q

Outline the operation of a wrist pulse oximeter. How can its output be exploited
for emotion recognition?

A

1) device emits light
2) amount of blood in vessel measured by amount of light reflected by the vessel over time
3) more blood ⇒ higher reflectance reading
4) Vasoconstriction (can be
measured if subject is
stationary):
- defensive reaction in
which peripheral blood
vessels constrict
- ⇒ increases in
response to pain,
hunger, fear & rage
- ⇒ decreases in
response to quiet
relaxation

28
Q

What is a photoplethysmograph (PPG) sensor?

A

1) Measures blood volume pulse:
- every heartbeat pumps blood to blood vessels
- most pronounced in peripheral vessels e.g. fingers, earlobe

29
Q

PPG sensor placement and reading

A

1) Place sensor where the capillaries are close to the surface of the skin, e.g., finger
2) No gels or adhesives
3) Reading is very sensitive to variations in placement & to motion artefacts

30
Q

How do SNS (sympathetic) and PNS (parasympathetic) affect heart rate

A

1) SNS accelerates heart rate - related to stress or activation
2) PNS decelerates heart rate - responsible for relaxation, or rest & healing

31
Q

HRV metrics

A

1) standard deviation of time between successive heart-beats within a certain window (i.e., the recording epoch), best for short time windows
2) difference between maximum & minimum normal R-R interval lengths within the window
3) percent differences between successive normal R-R intervals that exceed 50 mSec (pNN50)
4) root mean square successive difference (RMSSD)
5) sympathovagal balance: ratio of (influence by PNS) to
(influence by SNS) on HRV, PNS modulates heart rate for frequencies between 0.04 &
0.5 Hz, SNS modulates heart rate with significant gain only below 0.1 Hz

32
Q

Non-emotion factors affecting HR and HRV

A

1) Age, age ↑ ⇒ HRV ↓, e.g., infants have a high level of sympathetic activity but
decreases between ages 5 to 10
2) Level of physical conditioning, congestive heart failure: HRV ⇒ zero; heart beats like a metronome, subject with pace maker or taking medication
3) Physical activity, talking & posture (sitting vs standing vs lying down)
4) Breathing frequency
5) Circadian cycle (24 hour cycle in the physiological
processes)

33
Q

What is Electroencephalogram (EEG)? How can it be used for emotion recognition?

A

1) Measures electrical activity of brain by placement of electrodes on the
surface of the head
2) Full EEG distinguishes between positive & negative
emotional valence, and different arousal levels
3) Determines orienting response by detecting alpha blocking, alpha waves (8 to 13 Hz) become extinguished & beta waves (14 to 26 Hz) become dominant when the person experiences a startling event

34
Q

In affective computing state how does Electromyogram (EMG) aid with emotional recognition

A

1) Measures activity by detecting surface
voltages via 3 electrodes that occur when a
muscle is contracted
2) on facial muscles to study facial expression
3) on body to study affective gestures
4) on both - emotional valence & emotional
arousal
5) need adhesives and gels
6) Signal is low-pass filtered ⇒ aggregate
muscle activity & is sampled at 10 to 20 Hz

35
Q

In affective computing state how does blood pressure (sphygmomanometry) aid with emotional recognition

A

1) Correlates with increases in emotional stress & with repression
of emotional responses
2) Difficult to measure continuously
3) Requires constricting a blood vessel ⇒ discomfort

36
Q

In affective computing state how does respiration aid with emotional recognition

A

1) Most accurately recorded by measuring gas exchange of the lungs (but cumbersome)
2) Measured by a strap sensor that incorporates a
strain gauge, a Hall effect sensor, or a capacitance sensor to measure chest activity expansion
3) Physical activity & emotional arousal
⇒ faster & deeper respiration
4) Peaceful rest & relaxation
⇒ slower & shallower respiration
5) Sudden, intense, or startling stimuli
⇒ momentary cessation of respiration
6) Negative emotions
⇒ irregularity in respiration patterns

37
Q

Skin conductance or galvanic skin response or electrodermal activity (EDA)

A

1) Indirectly measures the amount of sweat in
a person’s sweat gland, skin is an insulator & its conductivity changes in response to ionic sweat
filling the sweat glands
2) Sweat-gland activity
an indicator of sympathetic activation, GSR is a robust non-invasive way to
measure this activation
4) GSR is a component in lie detector, measures the emotional stress
associated with lying
5) Differentiates between states, e.g., anger &
fear; conflict & no conflict
6) Measures stress on anticipatory anxiety &
during task performance

38
Q

Where are the most emotionally reactive sweat glands?

A

1) palms of hands & soles of feet

2) place electrodes on lower segment of middle & index fingers of dominant hand

39
Q

Features of skin conductance/GS/EDA

A

1) Mean conductivity level, variance, slope, max. & min. levels
2) Orienting response, amplitude, latency, rise time, half-recovery time
habituation
3) Skin conductance response is neither linear nor time invariant, baseline drift & conductance changes due to increased or decreased contact between electrodes & skin, ⇒ introduce confounding factors into interpreting GSR & into pooling features from different time periods (e.g., morning vs
evening)