speech perception Flashcards

1
Q
  1. Understand why designing computer speech recognition systems is difficult.
A
  • can’t match people’s ability to recognize speech.
  • Computers perform well when a person speaks slowly and clearly, when there is no background noise, and when they are listening for a few predetermined words or phrases
  • humans can perceive speech even when confronted with phrases they have never heard (the presence of various background noises, sloppy pronunciation, speakers with different dialects and accents etc.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Acoustic signal

A

produced by air that is pushed up from the lungs past the vocal cords and into the vocal tract.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

vowels

A

produced by vibration of the vocal cords

  • Each vowel has a characteristic series of ‘formants’ (resonant frequencies)
  • The first formant has the lowest frequency, the second has the next highest, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

formants

A

Formants: The frequencies at which these peaks

  • formant transition: Rapid shifts in frequency preceding or following formants are called (associated with consonants)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

consonants

A

Consonants are produced by a constriction, or closing, of the vocal tract (thus changes in vocal tract, i.e constriction of the vocal tract) and air flow around articulators.

every other sound (like consonants) are created by the movement of air and shape of your articulators (are the tongue, lips, teeth, jaw, and soft palate)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

phonemes

A

smallest unit of speech that changes meaning of a word

In English there are 47 phonemes:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

spectrogram

A
  • spectrogram indicates the pattern of - frequencies and intensities over time that make up the acoustic signal.
  • Frequency is indicated on the vertical axis
  • time (ms) is indicated on the horizontal axis;
  • intensity is indicated by darkness, with darker areas indicating greater intensity.
  • Intensity represented by darkness of bands - lower freqs
    (300-700Hz) more intense here.

Dark smudges - formants

  • bend in the band is formant transition
  • The vertical lines in the spectrogram are pressure oscillations caused by vibrations of the vocal cord.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

lack of invariance or variability problem:

A

no simple relationship between a particular phoneme and the acoustic signal

acoustic signal for a particular phoneme is variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

variability from different speakers

A

Speakers differ in pitch, accent, speed in speaking, and pronunciation -> This acoustic signal must be transformed into familiar words

  • Coarticulation: articulators are constantly moving as we talk, the shape of the vocal tract associated with a particular phoneme is influenced by the sounds that both precede and follow that phoneme. This overlap between the articulation of neighbouring phonemes is called coarticulation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

variability from context

A

even though we perceive the same /d/ sound in /di/ and /du/, the formant transitions, which are the acoustic signals associated with these sounds, are very different.

-Thus, the context in which a specific phoneme occurs can influence the acoustic signal that is associated with that phoneme.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Categorical perception

A

a wide range of acoustic cues results in the perception of a limited number of sound categories

  • done with a property called voice onset time (VOT)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

multimodal

A

speech perception is multimodal; our perception of speech can be influenced by information from a number of different senses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

McGurk effect

A

although auditory information is the major source of information for speech perception, visual information can also exert a strong influence on what we hear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

audio-visual speech perception

A
  • This influence of vision on speech perception is called

 The McGurk effect is one example of audio-visual speech perception. (Eg. people routinely use information provided by a speaker’s lip movements to help understand speech in a noisy environment )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Experiment

The McGurk effect

A

-Visual stimulus shows a speaker saying “ga-ga.”

  • Auditory stimulus has a speaker saying “ba-ba.”
  • Observer watching and listening hears “da-da”, which is the midpoint between “ga” and “ba.”
  • Observer with eyes closed will hear “ba.
  • The link between vision and speech has been shown to have a physiological basis.
  • Calvert et al. showed that the same brain areas are activated for lip reading and speech perception.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

“top-down” processing affects speech perception

A
  • Philip Rubin and coworkers (1976), for example, presented a series of short words, or nonwords, and asked listeners to respond by pressing a key as rapidly as possible whenever they heard a sound that began with /b/.
  • participants took 631 ms to respond to the nonwords and 580 ms to respond to the real words.
  • Thus, when a phoneme was at the beginning of a real word, it was identified about 8 percent faster

-speech perception is determined both by the nature of the acoustic signal (bottom-up processing) and by context that produces expectations in the listener (top-down processing).

17
Q

phonemic restoration effect:

A

The ability to fill in part of a word that has been obscured  was experienced even by students and staff in the psychology department who knew that the /s/ was missing.
- can be influenced by the meaning of words following the missing phoneme

18
Q

The segmentation problem -

A

there are no physical breaks in the continuous acoustic signal.

19
Q

speech segmentation

A

The perception of individual words in a conversation is called speech segmentation.

20
Q

How we perceive breaks in words

A

-knowledge: Top-down processing, including knowledge a listener has about a language, affects perception of the incoming speech stimulus
-perceptual organization of the sounds, and this change was achieved by your knowledge of the meaning of the sounds.
transitional probablilites
statistical learning

21
Q
  • transitional probabilities—the chances that one sound will follow another sound.
A

the chances that one sound will follow another sound.

22
Q

transitional probabilities

A

The process of learning about transitional probabilities and about other characteristics of language is called statistical learning. Research has shown that infants as young as 8 months of age are capable of statistical learning.

23
Q

The pop-out effect

A

shows that higher-level information such as listeners’ knowledge can improve speech perception.
- hat after experiencing the pop-out effect subjects be- came better at understanding other degraded sentences that they were hearing for the first time.

24
Q

broca’s aphasia

-

A
  • Broca’s area is located in the frontal lobe and thus frontal lobe damage
  • Patients with this problem—slow, laboured, ungrammatical speech caused by damage to Broca’s area, are diagnosed as having Broca’s aphasia
  • have difficulty forming complete sentences, understanding some types of sentences.
25
Q

Wernicke’s aphasia

A

-wernicke’s area= damage to an area in their temporal lobe that came to be called

produced speech that was fluent and grammatically correct but tended to be incoherent and unable to understand speech and writing.

-Broca’s aphasia have trouble understanding sentences in which meaning depends on word order, as in “The boy was pushed by the girl,” Wernicke’s patients have more widespread difficulties in understanding and would be unable to understand “The apple was eaten by the girl” as well

26
Q

word deafness

A

cannot recognise words, even though the ability to hear pure tones remains intact

patients with damage to the parietal lobe have difficulty discriminating between syllables

27
Q

voice area

A
  • ## Some patients with brain damage can discriminate words but are unable to discriminate syllables (and vice versa).Brain scans have also shown that there is – A voice area in the human superior temporal sulcus = activated more by human voices than by other sounds

Catherine Perrodin and co- workers (2011) recorded from neurons in the monkey’s temporal lobe that they called voice cells because they responded more strongly to recordings of monkey calls than to calls of other animals or to “non-voice” sounds.
- The “voice area” and “voice cells” are located in the temporal lobe, which is part of the what processing stream for hearing
-

28
Q

dual stream model of speech perception :

A

A ventral stream for recognizing speech and a dorsal stream that links the acoustic signal to movements for producing speech

the ventral pathway starts in the anterior (front) part of the auditory cortex and the dorsal pathway starts in the posterior (rear) part of the auditory cortex. The ventral pathway is responsible for recognizing speech, and it has been proposed that the dorsal pathway may be involved in linking the acoustic signal to the movements used to produce speech

29
Q

phonemes and electron coding

A
  • manner of articulation describes how the articulators interact while making a speech sound, and place of articulation describes the location of articulation
  • observed responses from some electrodes that were linked to specific phonetic features (one electrode picked up responses to sounds that involved place of articulation in the back of the mouth, such as /g/, and another responded to sounds associated with places near the front, such as /b/).
  • Thus, neural responses can be linked both to phonemes, which specify specific sounds, and to specific features, which are related to the way these sounds are produced.
  • The neural code for phonemes and phonetic features therefore corresponds to population coding
30
Q

Motor theory of perception

A

proposed a link between speech perception and action is the motor theory of speech perception

  • hearing a particular speech sound activates motor mechanisms controlling the movement of the articulators, such as the tongue and lips
  • activation of these motor mechanisms, in turn, activates additional mechanisms that enable us to perceive the sound.
  • discovery of mirror neurons.
31
Q

categorical perception of phonemes in infants

peter eimas and coworkers

A
  • habituation–> infants as young as 1 month old perform similarly to adults in categorical perception experiments.
  • infant suck nipple to hear a series of brief speech sounds, but when same speech is repeated infant’s sucking eventually habituates to low levels
  • when the VOT is shifted across the average adult phonetic boundary (left graph), the infants perceive a change in the sound, and when the VOT is shifted on the same side of the phonetic boundary (center graph), the infants perceive little or no change in the sound.
  • That infants as young as 1 month old are capable of categorical perception is particularly impressive because these infants have had virtually no experience in producing speech sounds and only limited experience in hearing them.
32
Q

social gating hypothesis
experience dependent learning
learning new languages (mandarin)
kuhl

A
  • b4 age 1
  • “shaping by experience” –> infant’s brain becomes specialised to discriminate between sounds that occur in the language the infant is hearing.
  • Kuhl and coworkers (2003) had 9-month-old infants attend 12 x 25-minute training sessions over a 4-week period, Mandarin-speaking teacher read them stories in Mandarin and talked to them. (teacher made frequent eye contact with the infants and said their names.
  • After training, the American infants did well on a test of Mandarin sounds,
  • Kuhl exposed another group of infants to a live teacher, they saw a DVD presentation of her reading the stories on a video monitor.
  • The performance of these infants was the same as the infants who had received no training in Mandarin.
  • This interaction provides interpersonal social cues that attract the infants’ attention and motivates learning.
  • social gating hypothesis: the social brain “gates” mechanisms that are responsible for language learning –> This hypothesis explains why learning doesn’t occur when the infants just view DVD images and suggests children with autism, who tend to avoid normal social contact, are deficient in language
33
Q

changes in the shape of your vocal tract and vibrations of the vocal chords change…

A

resonance of the frequency inside that vocal tract which produces peaks in pressure at a number of
frequencies called formants. (formant frequencies)

34
Q

perceptual constancy

A

People perceive speech easily in spite of the
variability problems due to perceptual
constancy.