04 - Speech Pattern Classification Flashcards by Joachim Andreasen

What is meant by linguistic information?

Information that is explicitly in or almost uniquely inferable from the written message.

How well did you know this?

Not at all

Perfectly

What is meant by paralinguistic information?

Information that is not inferable from the written message, but is added by the speaker to complement the linguistic information. Attitude, intonation, etc. “I am SOO excited!”

How well did you know this?

Not at all

Perfectly

What is meant by Nonlinguistic information?

Information about other factors such as age, gender, idiosyncrasy (personal traits), physical emotion. In general, conditions that are not related to the linguistic contents and cannot be controlled by the speaker. Example: Crossing your arms in protest.

How well did you know this?

Not at all

Perfectly

Speech Pattern Classification refers to?

The extraction of information from a speech such as language, accent etc. and to take an input and convert it into a sequence of class labels.

How well did you know this?

Not at all

Perfectly

In speech classification we normally divide into 3 models. These are?

Acoustic model, language model, pronunciation model

How well did you know this?

Not at all

Perfectly

The 2 main blocks of the speech pattern classifications process are?

Feature extraction and classification.

How well did you know this?

Not at all

Perfectly

What is the local region of analysis?

This is the framing of the data.

How well did you know this?

Not at all

Perfectly

What is the global region of analysis?

This is the functionals. For example the mean, median, max etc.

How well did you know this?

Not at all

Perfectly

What does ASR stand for?

Automatic Speech Recognition

How well did you know this?

Not at all

Perfectly

What is the segmental region of analysis?

This is the phonemes, voiced/unvoiced, word, etc.

How well did you know this?

Not at all

Perfectly

Spectral features in feature extraction are?

Classical speech (ASR) features, spectral measures. This is most likely the MFCC’s.

How well did you know this?

Not at all

Perfectly

Prosodic features in feature extraction are?

Pitch, energy, formants, timing, articulation etc.

How well did you know this?

Not at all

Perfectly

What are the delta-coefficients?

Delta features (first-order derivatives) provide information about the rate of change of the acoustic features over time. They are obtained by computing the differences between consecutive frames of the acoustic features. Delta features capture the dynamics of the speech signal and can help in modeling the transitions between different phonetic units.

How well did you know this?

Not at all

Perfectly

What are the double-delta coefficients?

Double-delta features (second-order derivatives) provide information about the acceleration or curvature of the acoustic features. They are computed by taking the differences between consecutive delta features. Double-delta features capture the changes in the rate of change of the acoustic features and can provide additional temporal information beyond the delta features.

How well did you know this?

Not at all

Perfectly

When do we use the Cepstral Mean (Variance) Normalization (CMN/CMVN)?

We use it in the pre-processing before conducting the actual analysis and we do it to reduce variation in various channels.

How well did you know this?

Not at all

Perfectly

Prosodic features refers to 4 things. These are?

Fundamental frequency (F0): mean, median, pitch contour etc.
Energy: shimmer, energy contours, voice level etc.
Duration: Speech rate, ratio of duration of voiced/unvoiced regions etc.
Formants: first to fourth formants, bandwidths etc.

How well did you know this?

Not at all

Perfectly

There are also some time-domain features, which were also prominent in subject 3 on speech signal representations. These can be for example?

Study These Flashcards

Zero Crossing Rate (ZCR)
Autocorrelation
Attack (duration, slope)
Temporal energy centroid

Why is it normal that we use Voice Activity Detection (VAD) in feature pre-processing?

Study These Flashcards

The presence of silence in the training data can corrupt the model. And likewise, silence in the test data will degrade the decision.

How do we apply Voice Activity Detection (VAD)?

Study These Flashcards

Energy thresholds (as we did in the second lab).
Waveform and spectrum analysis based on pitch and harmonic detection or maybe even ZCR.
Or finally, based on statistical models.

We can use some methods to manipulate our features to improve training time such as dimensionality reduction. Two of these measures are?

Study These Flashcards

Principle Component Analysis (PCA) and Linear Discriminant Analysis (LDA).

A classical approach to speech pattern recognition is the Gaussian Mixture Model (GMM). The GMM is a particular case of which model?

Study These Flashcards

The Hidden Markov Model(s). HMM.

Why is the Gaussian Mixture Model (GMM) so useful?

Study These Flashcards

Single gaussians cannot model. We need several gaussians and to mix them in order to model. In theory, we can fit anything given enough components.

Which method do we use to estimate GMM parameters?

Study These Flashcards

The Expectation-Maximization (EM) algorithm.

After using a method to estimate the GMM parameters, what do we then do?

Study These Flashcards

Compute the log-likelihood of the sequence of features.

What can we use Speaker Recognition (SR) for?

Speaker identification, access to closed spaces, verification

In Speaker Recognition (SR) tasks we often need to evaluate measures. We can do this using test trials. We have target trials and imposter trials. Each trial requires 2 outputs which are?

Actual decision (true/false) and a likelihood score (confidence)

There are two types of decision errors in Speaker Recognition tasks. These are?

Missed detections (Miss). These are the percentage of target trials rejected incorrectly. Normally called False Reject Rate (FRR). False alarms (FA). These are the percentage of imposter trials accepted incorrectly. Often called False Accept Rate (FAR)

What is the 'Equal Error Rate' (EER) a measure of?

It is when the False Alarms (FA or FAR) and the Missed Detections (Miss or FRR) cross each other. It is often used as a performance measure.

What can we use a Detection Error Tradeoff (DET) plot for?

It allows us to look at the tradeoff between Miss/FFR (x-axis) and FA/FAR (y-axis) to see the benefits of choosing a threshold. Kind of like a ROC-curve.

In Speaker Recognition (SR) there are 4 stages between the speech signals input and the results (output). These are?

Feature Extraction (front end), Representation, Variability compensation, classification (back end)

Features for speaker recognition should be PRS which is?

Practical, Robust and Secure

Variablity compensation in Speech Recognition refers to?

The changes in channel effects between training and successive detection attempts. This is microphones, the environment etc.

In Speaker Recognition we often use a GMM-UBM. What are the advantages of using this?

The Gaussian Mixture Model-Universal Background Model. The advantages is that it needs less data, only updates seen events, and keeps correspondence between means and fast scoring.

In recent time, research have focused on developing more robust systems to session variability. Why?

Because session variability is one of the largest challenges to practical use of SR systems. Think voice activation in a noisy environment for example.

What is the supervector concept?

The supervector concept takes each components from a GMM-UBM model, which represents a set of acoustic features, and concatenates the mean vectors of these features into a single SUPER vector. This enables very effective modelling and discrimination.

After the GMM-UBM the GMM-Support Vector Machine (GMM-SVM) method was introduced. Why was this a huge success?

SVM's perform a nonlinear mapping from a high-dimensional input space and thus is more efficient/fast and will improve modelling and discrimination.

What is the identity-vector (i-vector)?

The i-vector, which stands for "identity vector," is a concept in speaker recognition that represents speakers as points in a high-dimensional space. It captures speaker-specific information by modeling the variations in the mean supervector of a speaker's utterances. The i-vector approach is beneficial because it provides a compact and robust representation of speaker characteristics, enabling efficient modeling of speaker variability and facilitating speaker recognition tasks even in challenging conditions such as limited enrollment data and mismatched recording conditions. The i-vector concept has shown promising results and has become a popular technique in the field of speaker recognition.

In 2018, x-vectors were introduced as an extension of the i-vector. Why are these so successful?

X-vectors improved upon the i-vector approach by incorporating deep neural networks (DNNs) to extract speaker embeddings. X-vectors leverage the power of deep learning to capture more discriminative speaker representations, leading to enhanced speaker recognition performance, especially in scenarios with large speaker variability and challenging acoustic conditions.

What is the state-of-the-art in speaker verification?

The ESCAPA-TDNN. A deep residual convolutional neural network. It has less than 2% EER (Equal Error Rate).

What is the state-of-the-art in speaker diarization?

It is the VBx system. A variational Bayers hidden markov model with x-vectors. It has a Diarization Error Rate (DER) of about 5%.

What is the state-of-the-art in Speech Emotion Recognition?

A 2D CNN LSTM. Specifically the best one is currently the EmoDB.

04 - Speech Pattern Classification Flashcards

(41 cards)