Misc. Exam Questions Flashcards

Question 1

Q

Outline the function of the basilar membrane in terms of human hearing.

Answer

A

The basilar membrane will vibrate at frequencies corresponding to input acoustic wave frequencies (formats) and at a place along the basilar membrane that is associated with these frequencies.

Question 2

Q

Outline when you would use a narrowband or wideband spectrogram to study a speech sample, with direct reference to the different acoustic-phonetic characteristics apparent in each.

Clearly define suitable window durations for their extraction, explaining why those window lengths are appropriate.

Answer

A

Narrowband (20-40ms) and is used to emphasise the frequency changes. Vowels have strong harmonic content. Having strong frequency resolution means that the harmonic structure of the vocal fold vibration can be seen as horizontal stripes.
Wideband (5ms) and is used for good temporal changes. Good for finding the formant frequencies.

Question 3

Q

To build speaker models, they plan to use 2 minutes of data from each speaker to build each speaker-specific GMM using the EM algorithm, leaving the remainder for testing.

What are the shortcomings in this approach and what strategy would you instead recommend?

Answer

A

The EM algorithm assumes that the features are independent to each other, which is not necessarily the case in real life. This is also not enough to robustly estimate all feautres
Use a GMM-UBM or a end-to-end model (a deep learning model but only if there is enough data).

Question 4

Q

Explain how zero crossing rate allows you to classify speech into voiced and unvoiced speech. Comment on the accuracy of such an approach.

Answer

A

Unvoiced have energies above 1.5 kHz
Voiced have energies below

Question 5

Q

In a HMM, explain what is the ‘hidden’ element in the model? What does it typically correspond to in a speech recognition system?

Answer

A

The ‘hidden’ element HMM refers to the sequence of internal states that you don’t directly observe. You only see the outputs they generate.

You hear someone speaking behind a curtain.
You can hear the sound (observations) but you can’t see what words they’re saying (hidden states).
Your job is to guess the word based on the sounds.