Short Questions Flashcards
With reference to speech production, what are formants? Why do they change as different phonemes are produced?
- Formants are distinctive frequency components of the acoustic signal produced by speech.
- The position of the mouth, tongue and shape of vocal tract determine the frequency that the formant is produced.
For example, the corner vowels represent the frequencies where the most extreme position of the tongue is placed - i.e. /i/, /u/, /ae/, /a/
In a Hidden Markov Model, what is the ‘hidden’ element in the model?
The ‘hidden’ element is the underlying model of systems, but each state is observable and not hidden.
How is the cepstrum extracted from speech and why is a mel scale preferred in its calculation
The cepstrum is extracted by the following steps:
- windowing
- |DFT(x[n])| -> shows harmonics modulated by resonances
- log -> sum of harmonic ‘comb’ and resonant bumps
- IDFT
The mel scale is preferred in its calculations because it is based on the perception of human speech.
Give one advantage of End-to-End models for speech recognition over the use of HMM systems
- Ability to learn more from data, can capture greater variability in speech. HMMs struggle to capture same level of variability.
Why is the use of short-time magnitude measurements sometimes preferred over short-time energy measurements in voiced/unvoiced speech analysis?
Square in short-time energy makes it sensitive to large amplitude variations. (i.e. can be sensitive to large signals due to the x^2[n] term.
In using GMMs to model an individual emotion in an affective computing application, why is it possible to use a diagonal covariance matrix when using mel cepstrum as the feature type. Is this true of all features? Explain
Since the features are decorrelated, a diagonal covariance matrix can be implemented when using mel cepstrum as the feature type.
Identify two physiological changes due to ageing that alter the human speech production system. Outline their impact manifests itself in human speech.
- Vocal cords may lose muscle tone, which loses its flexibility and elasticity to produce certain formants.
- Lungs and overall respiratory system gets weaker.
These have the effect of making the voice weaker.
Explain how zero crossing rate allows you to classify speech into voiced or unvoiced speech.
- Zero-crossing occurs, for a discrete time signal, when successive samples differ in sign.
- More crossings mean higher frequency -> unvoiced speech contains higher frequencies.
Outline the approximate spacing of mel scale and explain how it is used for speech analysis.
- The mel scale is based on perception of pitch in humans.
- linear < 1 kHz < logarithmic
What is the effect of the log operation in extracting cepstrum from speech?
The log operation transforms the magnitude speech spectrum where the excitation component and vocal tract are multiplied, to a linear combination of these components
So bascially, in frequency domain:
ec = excitiation component
vtc = vocal tract component
|ec(w)|.|vtc(w)| -> log|s(w)| = log|ec(w)|+ log|vtc(w)|
What is principal behind the EM algorithm in training a GMM?
An iterative method to find ML estimates of parameters in statistical models, where the model depends on unobserved latent variables.
- Expectation of the log-likelihood evaluated using the current estimate for the parameters.
- Maximise the expected log-likelihood found on the E step.
Explain how short time energy allows you to classify speech into voiced or unvoiced speech. Comment on the accuracy of such an approach.
- When looking at the graph for short-time energy, you can tell the difference between voiced and unvoiced based on the amplitude.
- Voiced have higher energy, and can be seen in STE vs. time -> larger amplitude.
Explain, by using a suitable supporting example, whether 5ms is a suitable window size for the analysis of speech.
This depends on the task at hand. 5ms would be considered a small window and there is uncertainty due to a small amount of data. Especially considering varying pitch and amplitude.
The window size would be too small for effective speech analysis like finding formant frequencies or pitch tracking. A smaller window has the effect of increasing the temporal resolution, if that is needed.
What is the effect of the log operation in extracting cepstrum from speech?
The log operation seperates the signal of its vocal tract and excitation by addition.
log|s(w)| = log|vct(w)| + log|e(w)|
Explain how short time energy allows you to classify speech into voiced or unvoiced speech. Comment on the accuracy of such an approach.
Voiced speech contains more energy, meaning that the amplitude of the STE has higher peaks in these segments of speech.
In HMM, explain what is the ‘hidden’ element in the model? What does it typically correspond to in a speech recognition system?
The ‘hidden’ element in a HMM refers to the sequence of internal states that one directly doesn’t observe. You only see the observations they generate.
Explain, by using a suitable supporting example, why 10-40ms is a suitable window size for the analysis of speech.
- Emphasises spectral changes.
- Harmonic structure of the vocal fold vibration seen as horizontal stripes.
Compare voiced and unvoiced speech in terms of short-time energy and zero crossing rate. Are these good features to accurately segment voiced speech from unvoiced speech?
STE -> voiced has higher energy than unvoiced
ZC -> unvoiced above 1.5 kHz; voiced below 1.5 kHz
With reference to typical feature extraction in speech, why is a multivariate GMM rather than a univariate GMM, generally employed to model the extracted features?
- Using the univariate GMM, using the sample mean and sample variance does not work with large and complex training data.
- A multivariate GMM can have multiple distributions, meaning more complex data can be used in speech.
- A univariate GMM assumes features are independent, but this is not the case. Multivariate GMM captures correlation and leads to better accuracy.
Would you use a diagonal or full covariance matrix with a GMM to model cepstrum? Why?
- A diagonal covariance matrix would be more beneficial, as it is computationally lighter.
- Assuming a diagonal matrix, meaning that the features are independent to each other, means that the EM algo. can be utilised.
Outline the process to extract mel-frequency cestrum features from speech. What information in the speech signal are you seeking to preserve in this process and why?
Extracting mel-frequency cestrum:
- Windowing (Hamming)
- |DFT(s(w)|
- log|s(w)| = log|vct(w)| + |e(w)|
- IDFT
According to the source-filter theory, the resulting speech signal can be considered as the convolution of respective excitation sequence and vocal tract filter characteristics.
You are trying to preserve the vocal tract filter characteristics by selecting the first 13 cepstral coefficients.
Explain, by using a suitable supporting example, whether 100 ms is a suitable window size for the analysis of speech.
A 100ms would be generally suitable for applications that require high frequency resolution, but not necessarily for speech.
Window size usually falls into wideband and narrowband. This ranges from 5ms - 40ms.
Give an outline of how the autocorrelation of speech can be used to derive the fundamental frequency.
- Divide the speech into short-time windows (20ms - 40ms)
- Apply a hamming window.
- Compute autocorrelation
- Find the lag of the maximum peak
- Convert to frequency
What is the difference between a narrowband and wideband spectrogram? Explain what different information is found in each, and how to construct them.
Wideband:
1. Short-time window (5ms)
2. Emphasises temporal changes
3. Finds formant frequencies
Narrowband:
1. Longer time window (20ms - 40ms)
2. emphasises frequency changes
3. Harmonic structure of the vocal fold vibration seen as horizontal stripes.