speech perception Flashcards
- Understand why designing computer speech recognition systems is difficult.
- can’t match people’s ability to recognize speech.
- Computers perform well when a person speaks slowly and clearly, when there is no background noise, and when they are listening for a few predetermined words or phrases
- humans can perceive speech even when confronted with phrases they have never heard (the presence of various background noises, sloppy pronunciation, speakers with different dialects and accents etc.)
Acoustic signal
produced by air that is pushed up from the lungs past the vocal cords and into the vocal tract.
vowels
produced by vibration of the vocal cords
- Each vowel has a characteristic series of ‘formants’ (resonant frequencies)
- The first formant has the lowest frequency, the second has the next highest, etc.
formants
Formants: The frequencies at which these peaks
- formant transition: Rapid shifts in frequency preceding or following formants are called (associated with consonants)
consonants
Consonants are produced by a constriction, or closing, of the vocal tract (thus changes in vocal tract, i.e constriction of the vocal tract) and air flow around articulators.
every other sound (like consonants) are created by the movement of air and shape of your articulators (are the tongue, lips, teeth, jaw, and soft palate)
phonemes
smallest unit of speech that changes meaning of a word
In English there are 47 phonemes:
spectrogram
- spectrogram indicates the pattern of - frequencies and intensities over time that make up the acoustic signal.
- Frequency is indicated on the vertical axis
- time (ms) is indicated on the horizontal axis;
- intensity is indicated by darkness, with darker areas indicating greater intensity.
- Intensity represented by darkness of bands - lower freqs
(300-700Hz) more intense here.
Dark smudges - formants
- bend in the band is formant transition
- The vertical lines in the spectrogram are pressure oscillations caused by vibrations of the vocal cord.
lack of invariance or variability problem:
no simple relationship between a particular phoneme and the acoustic signal
acoustic signal for a particular phoneme is variable.
variability from different speakers
Speakers differ in pitch, accent, speed in speaking, and pronunciation -> This acoustic signal must be transformed into familiar words
- Coarticulation: articulators are constantly moving as we talk, the shape of the vocal tract associated with a particular phoneme is influenced by the sounds that both precede and follow that phoneme. This overlap between the articulation of neighbouring phonemes is called coarticulation.
variability from context
even though we perceive the same /d/ sound in /di/ and /du/, the formant transitions, which are the acoustic signals associated with these sounds, are very different.
-Thus, the context in which a specific phoneme occurs can influence the acoustic signal that is associated with that phoneme.
Categorical perception
a wide range of acoustic cues results in the perception of a limited number of sound categories
- done with a property called voice onset time (VOT)
multimodal
speech perception is multimodal; our perception of speech can be influenced by information from a number of different senses.
McGurk effect
although auditory information is the major source of information for speech perception, visual information can also exert a strong influence on what we hear
audio-visual speech perception
- This influence of vision on speech perception is called
The McGurk effect is one example of audio-visual speech perception. (Eg. people routinely use information provided by a speaker’s lip movements to help understand speech in a noisy environment )
Experiment
The McGurk effect
-Visual stimulus shows a speaker saying “ga-ga.”
- Auditory stimulus has a speaker saying “ba-ba.”
- Observer watching and listening hears “da-da”, which is the midpoint between “ga” and “ba.”
- Observer with eyes closed will hear “ba.
- The link between vision and speech has been shown to have a physiological basis.
- Calvert et al. showed that the same brain areas are activated for lip reading and speech perception.