Introduction to speech perception Flashcards
What are some challenges of speech perception?
Recording of sentence “he guessed the answer to the question in the exam”
- Unlike written language, no clear gaps between words
Eg. answer is one word but there might be 2 acoustic events here. Conversely in and the are two different words but theres no gap between the signal. - “the” sounds different in different positions (co-articulation- the acoustic realisation of speech depends on what you’ve just said and what you’re about to say)
This adds variability to acoustic speech and can make it hard to understand for a computer - Accent, gender and speaking rate
- Time constraints
- We hear up to 200 words per minute
- Sound is fleeting (sound is always changing, a temporal signal)
- “Now-or-never bottleneck” - speech is coming in quickly, sound doesn’t stay static- need to quickly process the word you’ve just heard before the next word comes in
Why study speech perception?
- Primary need in which we communicate
- More broadly- reading- learning to read requires you to learn the relationship between letters and speech sounds (Phoenix)
- Listeners who have some form of hearing loss. Cochlear implant which directly stimulates the brain. This restores hearing to some extent so adapting to an implant requires the brain to adapt to novel sensory information.
- Individuals with developmental language disorder- helpful for understanding whats going on and developing strategies to help them
How do we produce speech?
- what does speech require
- ____ pushes air to _____
- what does this result in
- what are sounds shaped by
- including?
- what are these structures important for?
- Speech requires a basic energy source. This initial energy source is provided by the lungs
- The lungs push air up the trachea (windpipe)
- which vibrates the vocal cords in the larynx (voicebox)
- Sounds from the vocal cords are then shaped by the supralaryngeal (all the structures above the larynx) vocal tract, including:
- Pharynx
- Oral cavity (and lips, tongue, teeth)
- Nasal cavity - These structures are important for shaping the sounds - you need these for intelligible speech
What method can be used to see speech production?
MRI
Describing speech: Consonants
How are consonants produced?
With a constriction in the vocal tract
Describing speech: Consonants
What are the 3 main features it’s classified by?
- Stop- for these consonants, the constriction thats happening is a complete constriction (air flow stops completely). These are voice consonants because vocal cords are vibrating.
- Fricative- constriction doesn’t happen completely
- Nasal- air flow is redirected to nasal cavity
Describing speech: Consonants
Stop:
+voice: b, d, g
-voice: p, t, k
d- constriction is happening when tongue touches upper teeth
g- tongue is touching the back of the mouth
Fricative:
+voice: v, z
-voice: f, s
Nasal:
m, n, n
What are sound waves?
Periodic displacement of air molecules, creating increases and decreases in air pressure
Speech as sound waves:
- what is happening
- what is formed
- vibrating source (plate thats moving back and force), this movement is moving the air molecules around (vibration of vocal cords). These are then going to be picked up by the ear and the ear will change these to a sensation sound.
- Plotting changes in sound pressure over time, at certain moments the air molecules come together and theres an increase in pressure.
- Sound waveform is formed and perceived by brain.
In relation to a sound waveform, what is amplitude and period?
Amplitude:
- related to loudness
- larger the peaks the louder
Period:
- inversely related to frequency; important cue to pitch
- peaks closer together = higher frequency and pitch
Speech as sound waves:
- what is speech associated with?
- how do you get speech?
- what is speech a mix of?
Speech is more complicated than sounding like a beep- theres more variations and its more complex.
Theres a relationship between what it looks like for a simple tone and more complicated. How you get speech is essentially a mix of sounds together- if you shape the amplitude over time, you will get this overtime.
Speech is a mix of lots of simpler sound creating this more complex speech.
Spectrogram: Analysing the frequencies of speech
1- what is a spectogram?
2- difference between dark grey and light grey?
3- why is useful?
4- what is being split?
- A spectrogram is a graph showing how sound amplitude varies as a function of time (x-axis) and frequency (y-axis)
- Dark grey = large amplitude, light grey = small amplitude
- Useful because the ear splits sound by frequency so better captures the information available to the brain.
- Split this sound into different frequency components. Brain and ears are splitting the information by frequency channel
Adding source and filter to how we produce speech
The lungs push air up the trachea (windpipe)
Which vibrates the vocal cords in the larynx (voicebox) → ‘Source’
Sounds from the vocal cords are then shaped by the supralaryngeal vocal tract → ‘Filter’
- Pharynx
- Oral cavity (and lips, tongue, teeth)
- Nasal cavity
Source-filter theory
Source only
Source (vocal cords) important for voice pitch and intonation
It provides some info such as voice pitch info
Source-filter theory
Source + filter
This shows how important the filter is for making intelligible speech
Filter (supralaryngeal vocal tract) important for producing different speech sounds (phonemes)
Filtering appears as bands of energy at certain frequencies called ‘formants’ (in Latin, “formare” = “to shape”)
The lowest three formant frequencies are the most important for speech intelligibility (labelled F1, F2 and F3)
Source-filter theory: Vowels
- What happens when changing from front to back vowels?
- What happens when changing from high to low vowels?
- Changing from front to back vowels e.g. “heed” vs “had” at F2 frequency decreases
- Changing from high to low vowels e.g. “heed” vs “hod” at F1 frequency increases
Source-filter theory: Vowels
Key Point
So your brain can know which vowel it is hearing by detecting these auditory “cues”
Source-filter theory: Consonants
What are important cues for identifying consonants?
Second and third formants (F2 and F3) are important cues for identifying consonants
For each of these consonants, when looking at the beginning they take on a different shape for p, t and k.
How do we perceive phonemes?
(Categorical perception and how to demonstrate it)
3 things
- Set up a continuum of sounds between two phonemes
- Run an identification experiment
- Run a discrimination experiment
How do we perceive phonemes:
1. Set up a continuum of sounds between two phonemes
Different sounds on each ends of the continuum
In the middle point in the continuum is ambiguous between ‘ba’ and ‘da’. In the middle theres an intermediary between the two.
You hear ‘ba’ in the beginning, then something intermediary between the two, then by the end it’s a clear ‘da’
How do we perceive phonemes:
2. Run an identification experiment
Identify if the sound you’re hearing is a ba or da sound.
Plot the percentage of responses.
When you’re hearing a clear unambiguous ba, most of the time people will respond with ba.
On the other end of the continuum when the sound is a clear da, hardly any of the time, they’re responding with ba (they’re responding with da instead)
If we find the point on this graph where listeners are equally likely to respond ‘ba’ and ‘da’- this is referred to as the phoneme boundary.
One of the main signatures of categorical perception is that around the phoneme boundary, you have an abrupt transition in this graph. Perception suddenly changes.
How do we perceive phonemes?
3. Run a discrimination experiment
Play pairs of adjacent sounds on the continuum and ask them to say if it’s the same or different
Plot the % of different responses
Discrimination peak near the phoneme boundary
What is categorical perception?
The tendency to perceive gradual sensory changes in a discrete fashion
What are three hallmarks of categorical perception?
- Abrupt change in identification at phoneme boundary
- Discrimination peak at phoneme boundary
- Discrimination predicted from identification (only sound “different” if classify the sound as a different phoneme)
Yanny or Laurel? Categorical perception in action
Yanny- 47%
Laurel- 53%
Sound has to be ambiguous but your brain can’t help but latch onto a specific interpretation- not an intermediate mix between the two.
Your brain in terms of understanding speech will try to latch onto a specific interpretation
Context influences speech perception:
Green needle/ brainstorm
Exactly the same sound but different expectations each time so changes how you perceiving it
Example of how speech perception depends on prior knowledge of context
Context influences speech perception:
Visual context “McGurk effect”
You hear one thing and see another thing. What you perceive is changed by what you see. Prior context effect because lip movements tend to perceive the speech that you hear and this influences what you perceive.
Context influences speech perception:
Lexical context “Ganong effect”
- Listener has to do identification task
- Plotting % of g responses
- Present in an iss sound
- Then present an ift sound
The graph shows
When you’re at the mid point- when the ambiguous sound is placed in a ift context and you show bias towards g- in combination with ift g will make a word
Placing in front of iss- you are biased towards k and this is because kiss is a real word and giss isn’t.
Exactly the same sound- ambiguous between g and k- even though its the same sound- your senses are bias towards an interpretation that makes a real word.