04 - Speech Pattern Classification Flashcards
What is meant by linguistic information?
Information that is explicitly in or almost uniquely inferable from the written message.
What is meant by paralinguistic information?
Information that is not inferable from the written message, but is added by the speaker to complement the linguistic information. Attitude, intonation, etc. “I am SOO excited!”
What is meant by Nonlinguistic information?
Information about other factors such as age, gender, idiosyncrasy (personal traits), physical emotion. In general, conditions that are not related to the linguistic contents and cannot be controlled by the speaker. Example: Crossing your arms in protest.
Speech Pattern Classification refers to?
The extraction of information from a speech such as language, accent etc. and to take an input and convert it into a sequence of class labels.
In speech classification we normally divide into 3 models. These are?
Acoustic model, language model, pronunciation model
The 2 main blocks of the speech pattern classifications process are?
Feature extraction and classification.
What is the local region of analysis?
This is the framing of the data.
What is the global region of analysis?
This is the functionals. For example the mean, median, max etc.
What does ASR stand for?
Automatic Speech Recognition
What is the segmental region of analysis?
This is the phonemes, voiced/unvoiced, word, etc.
Spectral features in feature extraction are?
Classical speech (ASR) features, spectral measures. This is most likely the MFCC’s.
Prosodic features in feature extraction are?
Pitch, energy, formants, timing, articulation etc.
What are the delta-coefficients?
Delta features (first-order derivatives) provide information about the rate of change of the acoustic features over time. They are obtained by computing the differences between consecutive frames of the acoustic features. Delta features capture the dynamics of the speech signal and can help in modeling the transitions between different phonetic units.
What are the double-delta coefficients?
Double-delta features (second-order derivatives) provide information about the acceleration or curvature of the acoustic features. They are computed by taking the differences between consecutive delta features. Double-delta features capture the changes in the rate of change of the acoustic features and can provide additional temporal information beyond the delta features.
When do we use the Cepstral Mean (Variance) Normalization (CMN/CMVN)?
We use it in the pre-processing before conducting the actual analysis and we do it to reduce variation in various channels.
Prosodic features refers to 4 things. These are?
Fundamental frequency (F0): mean, median, pitch contour etc.
Energy: shimmer, energy contours, voice level etc.
Duration: Speech rate, ratio of duration of voiced/unvoiced regions etc.
Formants: first to fourth formants, bandwidths etc.
There are also some time-domain features, which were also prominent in subject 3 on speech signal representations. These can be for example?
Zero Crossing Rate (ZCR)
Autocorrelation
Attack (duration, slope)
Temporal energy centroid
Why is it normal that we use Voice Activity Detection (VAD) in feature pre-processing?
The presence of silence in the training data can corrupt the model. And likewise, silence in the test data will degrade the decision.
How do we apply Voice Activity Detection (VAD)?
Energy thresholds (as we did in the second lab).
Waveform and spectrum analysis based on pitch and harmonic detection or maybe even ZCR.
Or finally, based on statistical models.
We can use some methods to manipulate our features to improve training time such as dimensionality reduction. Two of these measures are?
Principle Component Analysis (PCA) and Linear Discriminant Analysis (LDA).
A classical approach to speech pattern recognition is the Gaussian Mixture Model (GMM). The GMM is a particular case of which model?
The Hidden Markov Model(s). HMM.
Why is the Gaussian Mixture Model (GMM) so useful?
Single gaussians cannot model. We need several gaussians and to mix them in order to model. In theory, we can fit anything given enough components.
Which method do we use to estimate GMM parameters?
The Expectation-Maximization (EM) algorithm.
After using a method to estimate the GMM parameters, what do we then do?
Compute the log-likelihood of the sequence of features.