Exam 3 Flashcards
what is the external part of the ear called?
the pinna
what does the outer ear consist of?
pinna, ear canal, and the eardrum
what does the middle ear consist of?
from ear drum to the oval window: contains three small bones malleus, incus, and stapes
passage through the middle ear does what to the sound?
amplifies it
what does the inner ear consist of?
semicircular canal and cochlea
what happens in the cochlea?
mechanical sound waves are converted to electrical nerve impulses
for an unwound cochlea, there is a thicker and thinner end which is which? (Apical or Basal end)
thicker is the Apical end, thinner is the Basal end
for an unwound cochlea which frequencies do the Apical and Basal ends move more for?
the Apical end moves more for lower frequencies because thicker = lower resonant frequency
the Basal end moves more for high frequencies because thin = higher resonant frequency
the basilar membrane has a thick and thin end which are which (Apical and Basal)
Apical is thick and Basal is thin
the basilar membrane is tonotopically organized - what does that mean?
different locations on the membrane correspond to different frequencies
Denes and Pinson 1993 : 90
- shows how far the basilar membrane is pushed out of place by different frequencies
What was the conclusion found?
lower frequencies (25 Hz) are higher farther (30 mm) from the stapes than higher frequencies (1600 Hz –> 17 mm)
explain how hair cells work - what is their role?
they are attached to the basilar membrane, the hair cell fires if movement of the basilar membrane pushes the cell out of position sufficiently
what is the response curve of a hair cell?
shows the lowest intensity at which a pure tone at a given frequency triggers a firing of the cell - the low point shows the freq. the hair cell responds to most readily - the closer to the apical end (thick) the lower the resonant frequency
Moore 1997 : 33
- shows the response curves for different hair cells
What does the lowest point show?
the lowest point is the characteristic freq. where it will fire at the lowest amplitude
what is the most important factor of a hair cell?
the location of it - they are all the same otherwise
what causes hair loss at certain frequencies?
the hair cells are pushed too far and sheared off
the outer hair cells are different from inner how?
when the outer hair cells fire they change length to push back on the basilar membrane and amplify the signal
Denes and Pinson 1993 : 95
- shows human’s hearing range
What does this show about our speech sounds as humans?
What is the peak sensitivity?
speech sounds evolved to be where our hearing is particularly good - the peak sensitivity is between 1000 and 10,000 Hz
tonotopically organized signals from the ear are passed to the brain through what?
the auditory nerve, through various bodies in the brainstem and to the cerebrum (uppermost and outermost part of the brain)
signals from the right ear are passed to where? what is this called?
the left hemisphere of the brain - decussation
where is the auditory cortex located and what does it border?
in each hemisphere of the temporal lobe of the cerebral cortex on the superior temporal gyrus (STG) - borders the lateral (Sylvian) fissure
what is the primary auditory cortex?
entryway into the cerebral cortex for signals from the ears
how is the primary auditory cortex organized?
tonotopically - different locations correspond to different frequency bands
the frequency-based locations in the primary auditory cortex correspond to what?
frequency-sensitive locations on the basilar membrane
damage to the primary auditory cortex could cause what?
aphasia
Bear et al. 2007
- both hemispheres of the brain have an auditory cortex
But what?
but one is dominant to speech processing - left for 93% of people (96% of right-handed, 70% left handed)
what is dichotic listening?
speech materials are processed in the opposite hemisphere of the ear it receives it from, therefore there is often a right-ear processing advantage for speech but NOT for non-speech sounds like music or humming
what is Wernicke’s Area and where is it?
middle region of the STG that if injured causes problems with perception and comprehension (Wernicke’s Aphasia)
where is Wernicke’s Area in relation to the auditory cortex?
posterior, the auditory cortex is the “bottom” part of the STG
when and who discovered Wernicke’s area?
1874, German neurologist Karl Wernicke - it was early evidence for brain area specialization
True or false:
electrical stimulation of Wernicke’s area interferes with identification of speech sounds, discrimination between speech sounds, and comprehension of speech
true
what are combination-sensitive neurons?
in the STG, respond to particular patterns of frequency and amplitude - they fire only if there is an activation of a particular combination of primary cells
Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography).
True or false: this study was testing to see which parts of the brain activated when the patient was producing speech and when they were not.
False - the study was to see which parts of the brain were active when speech was playing, but inactive during silence
Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography).
True or false: the patients passively listened to 500 samples of SAE sentences
True
Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography).
True or false: researchers found that when passively listening to speech, the STG was activated constantly, but was not in silence
False - different groups of neurons in the STG activated for different classes of sounds
Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography).
True or false:
e1 responded to the sibilant fricatives /s, ʃ , z/
False - e1 responded to the plosives /b, d, g, p, t, k/
Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography).
True or false:
e2 responded to the sibilant fricatives /s, ʃ , z/
true
Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography).
True or false:
e3 responded to the low-back vocoids (vowels and glides) /ɑ, aʊ/
true
Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography).
True or false:
e4 responded to the plosives /b, d, g, p, t, k/
False - e4 responded to the high-front vocoids /i, j/
Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography).
True or false:
e5 responded to nasals /m, n, ŋ/
true
Mesgarani, Cheung, Johnson and Chang (2014)
What is PSI?
phoneme selectivity index, represents the number of other phonemes statistically distinguishable from that phoneme in the response of a specific electrode
Mesgarani, Cheung, Johnson and Chang (2014)
what does a PSI = 0 mean?
that electrode does NOT distinguish between that phoneme and any others
Mesgarani, Cheung, Johnson and Chang (2014)
true or false:
PSI = 32 means the electrode can detect 32 phonemes
false - it means the electrode is maximally selective, the phoneme is distinguishable from all other phonemes in the response of that electrode
Mesgarani, Cheung, Johnson and Chang (2014)
true or false:
neurons sensitive to a particular acoustic combination are located near neurons sensitive to similar combinations and are therefore tonotopically organized
false - they are organized by phonetic category
Mesgarani, Cheung, Johnson and Chang (2014)
true or false:
from the basilar membrane to the primary auditory cortex, sound is represented tonotopically in the form of time-varying frequency spectrum, corresponding to a spectrogram
true
what is a category?
a set of entities or events that all elicit an equivalent response
categories are essential to learning and cognition - why?
we can only generalize particular experiences to general knowledge through the use of categories
true or false:
speech categories are the same across people and situations
false - they vary greatly from speaker to speaker and context to context; each person has a broad range of phonetic events they pull from to decode a word or sound
true or false:
an acoustic continuum is a series of items that differ gradiently for a series of acoustic properties
false - only one acoustic property not multiple
true or false:
an F1 continuum would be a series of items that have the same F1 but are different in other aspects
false - the items would differ ONLY in F1
true or false:
in a F1 continuum, the difference in F1 of each of the items in the series is the same as the preceding member
true
Lisker & Abramson 1970
- varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language
true or false:
the space where the lines for either sound meets is called the perceptual/identification boundary
true
Lisker & Abramson 1970
- varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language
true or false:
the study found that at low VOT, English speakers identified the stop as voiceless 100% of the time
false - at low VOT the subjects identified the sounds as VOICED 100% of the time
Lisker & Abramson 1970
- varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language
true or false:
at high VOT the English subjects identified the stop as voiceless 100% of the time
true
Lisker & Abramson 1970
- varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language
true or false:
the perceptual / identification boundary is where subjects were able to tell the stops apart 100% of the time
false - the boundary is where they identified the stimulus as voiced 50% of the time and voiceless 50% of the time
true or false:
Lisker and Abramson (1964) found that further forward places of articulation are associated with greater VOT values
false - places of articulation that are further back are associated with greater VOT values
Lisker & Abramson 1970
- varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language
true or false:
a conclusion drawn from this study is that the identification boundary is at a lower VOT for alveolars and velars than for bilabials
false - because alveolar and velar sounds are further back in the mouth = greater (higher) VOT
what is categorical perception?
listeners ignore the differences of sounds on the same side of the perceptual boundary and only discriminate sounds that lie on opposite sides
Lisker & Abramson 1970
- varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language
true or false:
this study found that speakers differentiate sounds within each side of the perceptual boundary
false - they ignore the differences of those on the same side and only discriminate sounds that lie on opposite sides of the boundary
Liberman et al 1957
- synthesized a series of stop-vowel syllables that were alike in steady-state values of F1 and F2 - they only differed in the onset value of the initial F2 transition from way above F2 steady-state to way below (hand drawn looked like eyebrows) - subjects were asked to identify as b, d, or g
true or false:
when F2 pointed down, subjects identified the consonant as d
false - F2 pointing down was identified as b
Liberman et al 1957
- synthesized a series of stop-vowel syllables that were alike in steady-state values of F1 and F2 - they only differed in the onset value of the initial F2 transition from way above F2 steady-state to way below (hand drawn looked like eyebrows) - subjects were asked to identify as b, d, or g
true or false:
when F2 was flat, subjects identified the consonant as g
false - F2 was flat it was identified as d
Liberman et al 1957
- synthesized a series of stop-vowel syllables that were alike in steady-state values of F1 and F2 - they only differed in the onset value of the initial F2 transition from way above F2 steady-state to way below (hand drawn looked like eyebrows) - subjects were asked to identify as b, d, or g
true or false:
when F2 pointed up, subjects identified the consonant as b
false - F2 pointed up was identified as g
Liberman et al 1957
- synthesized a series of stop-vowel syllables that were alike in steady-state values of F1 and F2 - they only differed in the onset value of the initial F2 transition from way above F2 steady-state to way below (hand drawn looked like eyebrows) - subjects were asked to identify as b, d, or g
there is only one ambiguous stop in between which two stimulus?
3 (almost always b) and #5 (almost always d) - boundary between b and d
Liberman et al 1957
discrimination experiment:
- synthesized a series of stop-vowel syllables that were alike in steady-state values of F1 and F2 - they only differed in the onset value of the initial F2 transition from way above F2 steady-state to way below (hand drawn looked like eyebrows) - subjects listened to a series of 3 syllables (b, d, or g) together (e.g. ABX) where A and B are different and X is either identical to A or B
true or false:
if two of the syllables were within the same category (same side) subjects found it hard to discriminate between them
true
why might humans be more sensitive to acoustic cues that distinguish categories and insensitive to those within the categories?
because the acoustic differences within categories do NOT help with our goal of identifying what sound is being produced
Miyawaki et al. 1975
- synthesized syllables with a sonorant consonant followed by [ɑ], the only difference was the frequency of F3 in the consonant (r, l) - subjects (SAE and Japanese) heard each in random order and asked to determine if they were l or r (law or raw).
true or false:
stimuli with a low F3 in the consonant was identified as “l” nearly 100% of the time
false - low F3 were identified as “r” nearly 100% of the time
Miyawaki et al. 1975
- synthesized syllables with a sonorant consonant followed by [ɑ], the only difference was the frequency of F3 in the consonant (r, l) - subjects (SAE and Japanese) heard each in random order and asked to determine if they were l or r (law or raw).
true or false:
stimuli with a high F3 in the consonant were identified as “l” nearly 100% of the time
true
Miyawaki et al. 1975
- synthesized syllables with a sonorant consonant followed by [ɑ], the only difference was the frequency of F3 in the consonant (r, l) - subjects (SAE and Japanese) heard each in random order and asked to determine if they were l or r (law or raw).
there was one stimulus that could not be clearly assigned by the subjects - what does this mean and which one was it?
7, it was the identification boundary between l and r
Miyawaki et al. 1975
- synthesized syllables with a sonorant consonant followed by [ɑ], the only difference was the frequency of F3 in the consonant (r, l) - subjects (SAE and Japanese) heard each in random order and asked to determine if they were l or r (law or raw).
what were the three main findings of this study?
- SAE speakers did well distinguishing the sounds on opposite sides of the boundary
- SAE speakers were guessing/leaving it to chance when discriminating within the categories
- Japanese speakers, having no contrast between the sounds in Japanese, could not distinguish the sounds
are vowels similar in discrimination to consonants? why or why not?
no they aren’t, there is a perceptual boundary but it is not a peak in discriminability like consonants, it is gradable - people can discriminate within vowel categories as well as between them
what is one hypothesis as to why consonants have a perceptual boundary and vowels don’t?
categorical perception may be limited to rapid, dynamic acoustic properties, like the VOT and F2 formant transitions between consonants and vowels, but vowels have steady-state formant patterns that stay the same for what in speech is a long time
what is speaker normalization?
the listener’s ability to handle/understand the differences among speakers that are unlike what they have heard before
what are the 3 main ways speaker’s voices differ and which one is the MAIN way?
- MOST IMPORTANTLY differ in the formant frequencies
- they differ in f0 (higher or lower pitch) depending on the length of their vocal chords
- voice quality as measured in open quotient or spectral tilt
true or false:
only F1 is higher in women than men
false - F1 and F2 are higher in women than men
men are generally larger than women, and women are larger than children - what does this mean in terms of their voices?
men have longer vocal tracts than women who have longer ones than children therefore men have the lowest resonant frequencies, then women’s, then children - however that does not mean that all large people have the deepest voices
true or false:
the difference between men and women lies mainly in the length of the pharynx
true
true or false:
Peter Ladefoged has formant values for his vowels that are close to those of SAE but are not the same vowel and are therefore easily confused
false - though the formant values are close for one vowel said by him and a different one said by a SAE speaker, they do NOT get confused for each other - distinction is NOT ONLY in formant values (speaker normalization)
what is one of the biggest problems when developing automatic speech recognition software?
computers cannot, as easily or as well as humans, perform speaker normalization when encountering a new voice unlike what they’ve heard before
Ladefoged and Broadbent 1957
- synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence “Please say what this word is” before the next word which subjects had to identify.
true or false:
with the “normal” carrier sentence test word A (F1: 375 Hz) was identified as “bat”
false - with the “normal” carrier sentence test word A (375 Hz) was identified as “bit”
Ladefoged and Broadbent 1957
- synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence “Please say what this word is” before the next word which subjects had to identify.
true or false:
with the “normal” carrier sentence test word B (F1: 450) was identified as “bet”
true
Ladefoged and Broadbent 1957
- synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence “Please say what this word is” before the next word which subjects had to identify.
true or false:
with the “normal” carrier sentence test word C (F1: 575 Hz) was identified as “but”
false - with the “normal” carrier sentence test word C (F1: 575 Hz) was identified as “bat”
Ladefoged and Broadbent 1957
- synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence “Please say what this word is” before the next word which subjects had to identify.
true or false:
with the “normal” carrier sentence test word D (F1: 600 Hz, F2: 1300 Hz) was identified as “bat”
false - with the “normal” carrier sentence test word D (F1: 600 Hz, F2: 1300 Hz) was identified as “but”
Ladefoged and Broadbent 1957
- synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence “Please say what this word is” before the next word which subjects had to identify.
true or false:
when F1 was lowered in the carrier sentence, test word A (375 Hz) started to be identified as “bet”
true - in the low F1 context, the value of 375 Hz counted as high in comparison so the vowel was judged to be low
Ladefoged and Broadbent 1957
- synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence “Please say what this word is” before the next word which subjects had to identify.
true or false:
when F1 was raised in the carrier sentence, test word B (450 Hz) started to be identified as “bat”
false - “bet” began to be identified as “bit” because with the context of high F1 values in the carrier, 450 Hz counted as low in comparison so the vowel was judged to be high
Ladefoged and Broadbent 1957
- synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence “Please say what this word is” before the next word which subjects had to identify.
true or false:
when F1 was raised in the carrier sentence, test word C (575 Hz) started to be identified as “bit”
false - “bat” started to be identified as “bet” because compared to the high F1 in the carrier, 575 Hz was not that high so the vowel was judged to be mid rather than low
Ladefoged and Broadbent 1957
- synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence “Please say what this word is” before the next word which subjects had to identify.
true or false:
when F2 was lowered in the carrier sentence, test word D (F1: 600 Hz, F2: 1300 Hz) started to be identified as “but”
false - “but” started to be identified as “bat” because compared to the low F2 values, 1300 Hz was not all that low, and was judged to be front
Ladefoged and Broadbent 1957
- synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence “Please say what this word is” before the next word which subjects had to identify.
what was the conclusion found by this study?
listeners notice where the formants are in vowels from a new speaker and adapt their model of the vowel space to fit the new voice - their expectations change as they learn where the new speakers vowels are which can happen in a matter of seconds - intelligent problem solving NOT passive
Mullenix et al. 1989
- one group of listeners identified lists of words in noise produced by a single speaker, while another group heard the same words produced by multiple speakers
true or false:
the group hearing a single speaker in noise identified the words slower and less accurately that those hearing multiple speakers
false - the group hearing a single speaker in noise were faster and more accurate because over the short period of time they were able to learn more about the single voice and improve their processing of their speech
Nygaard and Pisoni 1998
- subjects listen to samples from 10 different speakers over 10 days, they learned voices well enough to match a new sample to them, and were presented with a word that needed to identify in noise.
true or false:
subjects made fewer errors identifying words in noise if it was produced by one of the voices they were already familiar with
true
what is priming?
previous exposure to one stimulus (the prime) improves processing performance (accuracy and speed) on the task with a later stimulus (the target)
Nygaard and Pisoni 1998
- subjects listen to samples from 10 different speakers over 10 days, they learned voices well enough to match a new sample to them, and were presented with a word that needed to identify in noise.
how does priming help explain the results this study?
the priming was greater when the prime and the target were produced by the same voice which implies that the voice was part of the memory representation for the prime
Goldinger 1996, 1998
- exposed subjects to words produced by different speakers in a study session, they were tested in various tasks involving those words in a test session (e.g. have you heard this word before in the study session?)
what were the results of this experiment?
the subjects were quicker and more accurate in making this judgement if they heard the word produced by the same speaker who produced it in the study session
Goldinger 1996, 1998
- exposed subjects to words produced by different speakers in a study session, they were tested in various tasks involving those words in a test session - they were asked to identify sounds in the word, discriminate sounds in the word, and repeat the word as quickly as possible (shadowing)
what were the results of these tasks?
they all were done faster and more accurately if they had previously heard the test word produced by the same voice as in the test
Goldinger 1996, 1998
- exposed subjects to words produced by different speakers in a study session, they were tested in various tasks involving those words in a test session - they were asked to identify sounds in the word, discriminate sounds in the word, and repeat the word as quickly as possible (shadowing)
true or false:
this experiment shows that activating both word and voice at the same time is less effective than just activating the word
false - it is more effective to activate both the word and the voice because they both activate memory of that word and that voice
what is an exemplar?
the memory representation of a category like the word “cat” consists of every instance of that word one has ever encountered organized by recency, speaker, context, etc.
true or false:
if you heard a word recently from a speaker, you can more quickly process that word in a new instance from the same speaker
true - speaker normalization is partially responsible for this
what is the perceptual challenge of coarticulation?
there is a different version of every phoneme for every preceding sound and for every following sound - the differences can be as large as those between categories
what is the main problem with speech recognition programs?
it is hard for them to account for coarticulation so a speech sound or spoken word from one context won’t match a sample of the same sound or word from another context
true or false:
to counteract coarticulation problems, listeners remember vowels with the sound that precedes it, not by itself
false - they remember the sound that precedes it and the sound that follows it
Lindblom and Studdert-Kennedy 1967
- series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j
what would be the expectation for the F2 of the vowel between j__j?
F2 would be high, like it is in j
Lindblom and Studdert-Kennedy 1967
- series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j
what would be the expectation for the F2 of the vowel between w__w?
F2 of the vowel would be low, like w
Lindblom and Studdert-Kennedy 1967
- series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ
true or false:
when the vowel had a high F2 it was identified as ʊ no matter the context
false - it was identified as I no matter the context
Lindblom and Studdert-Kennedy 1967
- series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ
true or false:
when the vowel had a low F2 it was identified as ʊ no matter the context
true
Lindblom and Studdert-Kennedy 1967
- series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ
true or false:
when the vowel has intermediate F2 values, it was identified as ʊ more often in the low F2 w__w environment than in isolation
false - it was identified as I
Lindblom and Studdert-Kennedy 1967
- series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ
true or false:
when the vowel has intermediate F2 values, it was identified as I less often in the high F2 j__j environment than in isolation
true
Lindblom and Studdert-Kennedy 1967
- series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ
how did the identification boundary shift for I and ʊ?
shifted toward lower values of F2 in the low F2 context and toward higher F2 values in the higher F2 context
Lindblom and Studdert-Kennedy 1967
- series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ
what is the interpretation (2) of this study’s results?
when listeners hear a vowel in a context that raises F2, such as j__j, they know that part of the height of F2 for that vowel is due to context - to compensate for this effect they raise the F2 boundary between I and ʊ which reduces the range of F2 values that are identified as I
when listeners hear a vowel in a context that lowers F2, such as w__w, they know that part of the lowness of F2 for the vowel is due to context - to compensate for this, they lower the F2 boundary between I and ʊ which increases the range of F2 values that are identified as I
Mann and Repp 1980
- investigating the effect of a following vowel on the perceptual distinction between s and ʃ - synthesized a continuum of 9 fricatives varying in center frequency from ʃ (1957 Hz) to s (3917 Hz - each occurred with two different following vowels ɑ and u - listeners were asked to identify “sh” or “s”
true or false:
listeners had more “sh” responses with higher center freq. (to the left of the chart) than with lower ones (to the right)
false - more “sh” responses with lower central freq. (to the left) than with higher ones (to the right)
what is the difference between s and ʃ ?
both are sibilants with intense noise extending down from the highest freq. but the center freq. of the noise is lower in ʃ than in s (the noise extends down lower in ʃ )
Mann and Repp 1980
- investigating the effect of a following vowel on the perceptual distinction between s and ʃ - synthesized a continuum of 9 fricatives varying in center frequency from ʃ (1957 Hz) to s (3917 Hz - each occurred with two different following vowels ɑ and u - listeners were asked to identify “sh” or “s”
true or false:
listeners had more “sh” responses at the higher center frequencies in the context of ɑ than in the u context
true
Mann and Repp 1980
- investigating the effect of a following vowel on the perceptual distinction between s and ʃ - synthesized a continuum of 9 fricatives varying in center frequency from ʃ (1957 Hz) to s (3917 Hz - each occurred with two different following vowels ɑ and u - listeners were asked to identify “sh” or “s”
what is the interpretation of the results of this study?
listeners take into account the following vowel when identifying a fricative - they know the effects of coarticulation on each sound - because u is rounded it lengthens the vocal tract and lowers the frequency so the center frequency of a sibilant is lower if it is followed by u rather than “a”
Mann and Repp 1980
- investigating the effect of a following vowel on the perceptual distinction between s and ʃ - synthesized a continuum of 9 fricatives varying in center frequency from ʃ (1957 Hz) to s (3917 Hz - each occurred with two different following vowels ɑ and u - listeners were asked to identify “sh” or “s”
true or false:
in order to identify a fricative as ʃ, the center frequency has to be lower before u
true - because listeners attribute some lowness of center freq. there to the vowel context
Mann and Repp 1980
- investigating the effect of a following vowel on the perceptual distinction between s and ʃ - synthesized a continuum of 9 fricatives varying in center frequency from ʃ (1957 Hz) to s (3917 Hz - each occurred with two different following vowels ɑ and u - listeners were asked to identify “sh” or “s”
true or false:
knowing that u has a lowering effect on center frequency, listeners adjust the expected center frequency for s upward
false - they adjust it downward
West 1999
- investigate the coarticulatory effects of approximants l and r on neighboring vowels - for the words “a berry” and “a belly” the coarticulation effects extended so far, all vowels had differences in F1 - F3 depending on the medial consonant - the liquid was replaced with noise and listeners were asked to identify which word they heard
what did the study find?
the subjects were able to identify the word correctly if noise covered most of the word as long as they could still hear the vowel preceding the liquid
West 1999
- investigate the coarticulatory effects of approximants l and r on neighboring vowels - for the words “a berry” and “a belly” the coarticulation effects extended so far, all vowels had differences in F1 - F3 depending on the medial consonant - the liquid was replaced with noise and listeners were asked to identify which word they heard
what does the knowledge from the results of this study show us?
our identification is noise-resistant since it spreads out the acoustic evidence a listener can use in identification
what is bottom-up processing?
when one figures out the bigger units on the basis of the smaller units they contain for example determining what sound is being produced by the sounds it contains
what is top-down processing?
figuring out the smaller constituent units on the basis of bigger units that contain them for example the identification of speech sounds is informed by out knowledge of words in our language
Ganong 1980
- synthesized words varying just in the VOT of an initial alveolar or velar stop - in one class of series (word-nonword), a voiced stop in that position would form a word and a voiceless stop would form a non-word (e.g “dash” and “tash”) - in the other condition (nonword-word) an initial voiced stop would form a non-word and a voiceless stop would form a word (e.g “dask” and “task”) - listeners had to identify the inital sound as voiced or voiceless
what was the measure of this study?
the percentage of times that the subject identified a stimulus as voiced (d or g) with 2 factors: VOT of the initial stop and word status (word-nonword or nonword-word)
Ganong 1980
- synthesized words varying just in the VOT of an initial alveolar or velar stop - in one class of series (word-nonword), a voiced stop in that position would form a word and a voiceless stop would form a non-word (e.g “dash” and “tash”) - in the other condition (nonword-word) an initial voiced stop would form a non-word and a voiceless stop would form a word (e.g “dask” and “task”) - listeners had to identify the inital sound as voiced or voiceless
true or false:
the hypothesis of the study was that stops with lower VOT values will more often be identified as voiceless than stops with higher VOT values
false - stops with lower VOT values will more often be identified as voiced
Ganong 1980
- synthesized words varying just in the VOT of an initial alveolar or velar stop - in one class of series (word-nonword), a voiced stop in that position would form a word and a voiceless stop would form a non-word (e.g “dash” and “tash”) - in the other condition (nonword-word) an initial voiced stop would form a non-word and a voiceless stop would form a word (e.g “dask” and “task”) - listeners had to identify the inital sound as voiced or voiceless
true or false:
the hypothesis of this study is that all else being equal, listeners will tend to give the identification answer that yields a word
true
Ganong 1980
- synthesized words varying just in the VOT of an initial alveolar or velar stop - in one class of series (word-nonword), a voiced stop in that position would form a word and a voiceless stop would form a non-word (e.g “dash” and “tash”) - in the other condition (nonword-word) an initial voiced stop would form a non-word and a voiceless stop would form a word (e.g “dask” and “task”) - listeners had to identify the inital sound as voiced or voiceless
true or false:
listeners had higher proportion of voiced identification responses when VOT was lower - for BOTH word and nonword conditions
true
Ganong 1980
- synthesized words varying just in the VOT of an initial alveolar or velar stop - in one class of series (word-nonword), a voiced stop in that position would form a word and a voiceless stop would form a non-word (e.g “dash” and “tash”) - in the other condition (nonword-word) an initial voiced stop would form a non-word and a voiceless stop would form a word (e.g “dask” and “task”) - listeners had to identify the inital sound as voiced or voiceless
the proportion of voiced responses was higher in the word-nonword condition when VOT was lower - why might that be?
the condition of a voiced consonant actually made a word so people used their knowledge of words to identify the consonant
Ganong 1980
- synthesized words varying just in the VOT of an initial alveolar or velar stop - in one class of series (word-nonword), a voiced stop in that position would form a word and a voiceless stop would form a non-word (e.g “dash” and “tash”) - in the other condition (nonword-word) an initial voiced stop would form a non-word and a voiceless stop would form a word (e.g “dask” and “task”) - listeners had to identify the inital sound as voiced or voiceless
VOT changes affected voicing identification - is this an example of top-down processing or bottom-up processing?
bottom-up processing - the physical properties of an individual sound help identify the word
Ganong 1980
- synthesized words varying just in the VOT of an initial alveolar or velar stop - in one class of series (word-nonword), a voiced stop in that position would form a word and a voiceless stop would form a non-word (e.g “dash” and “tash”) - in the other condition (nonword-word) an initial voiced stop would form a non-word and a voiceless stop would form a word (e.g “dask” and “task”) - listeners had to identify the inital sound as voiced or voiceless
word status affected the voicing identification - is this an example of top-down processing or bottom-up processing?
top-down processing - using knowledge of vocabulary to help decide what sound they heard
what are phonotactic restrictions?
generalizations about what sequences of sounds can occur in some position in an utterance for example at the beginning of a syllable
Massaro and Cohen 1983
- speech synthesis to create a series of syllables with a liquid preceding the vowel [i] - the syllables only differed in F3 (low F3 = r, high F3 = l) - ree/lee were preceded by a synthesized consonant: p, t, s, or v - subjects were asked to select of the possible combinations which they heard
what was the main goal of this study?
it was expected that the lower F3, the more likely subjects will choose sounds with r not l - but would the identification differ depending on the preceding consonant?
Massaro and Cohen 1983
- speech synthesis to create a series of syllables with a liquid preceding the vowel [i] - the syllables only differed in F3 (low F3 = r, high F3 = l) - ree/lee were preceded by a synthesized consonant: p, t, s, or v - subjects were asked to select of the possible combinations which they heard
true or false:
the highest proportion of r responses was for a stimuli beginning with t, which would be incompatible with a following l in English
true
Massaro and Cohen 1983
- speech synthesis to create a series of syllables with a liquid preceding the vowel [i] - the syllables only differed in F3 (low F3 = r, high F3 = l) - ree/lee were preceded by a synthesized consonant: p, t, s, or v - subjects were asked to select of the possible combinations which they heard
true or false:
the lowest proportion of r responses was for stimuli beginning with s, which is incompatible with a following r
true
Massaro and Cohen 1983
- speech synthesis to create a series of syllables with a liquid preceding the vowel [i] - the syllables only differed in F3 (low F3 = r, high F3 = l) - ree/lee were preceded by a synthesized consonant: p, t, s, or v - subjects were asked to select of the possible combinations which they heard
p and v responses had mid level proportions - why is that?
p is compatible with either l or r and v is compatible with neither - the effect being tested here does not apply to these
Massaro and Cohen 1983
- speech synthesis to create a series of syllables with a liquid preceding the vowel [i] - the syllables only differed in F3 (low F3 = r, high F3 = l) - ree/lee were preceded by a synthesized consonant: p, t, s, or v - subjects were asked to select of the possible combinations which they heard
what are the overall results of this study?
when F3 was highest, subjects identified it as l regardless or what preceded it, when F3 was lowest, subjects identified it as r regardless of what preceded it, but over all F3 values they avoided identifying sound sequences that cannot occur in English
Massaro and Cohen 1983
- speech synthesis to create a series of syllables with a liquid preceding the vowel [i] - the syllables only differed in F3 (low F3 = r, high F3 = l) - ree/lee were preceded by a synthesized consonant: p, t, s, or v - subjects were asked to select of the possible combinations which they heard
how is phonotactic knowledge being used here?
subjects used their phonotactic knowledge of where English sounds can occur as a comparison to what they heard to see the likelihood of that sound occurring in that context
what is syntax?
how words fit together in sentences
how do we use syntax in speech processing?
we use out knowledge of it to restrict the possible words that could fill a given slot we are trying to identify
Miller, Heise, and Lichten 1951
- presented words (real and nonsense) to subjects at varying signal-to-noise rations and in different contexts and had to identify what was being said
true or false:
the higher the signal to noise ratio, the more accurate the identification was
true - quieter noise, louder speech
Miller, Heise, and Lichten 1951
- presented words (real and nonsense) to subjects at varying signal-to-noise rations and in different contexts and had to identify what was being said
in what context did subjects do better in?
when real words in sentences were used and even better with digits in a sequence
Miller, Heise, and Lichten 1951
- presented words (real and nonsense) to subjects at varying signal-to-noise rations and in different contexts and had to identify what was being said
why was it significantly harder for subjects to identify nonsense words than actual words in a sentence?
for a nonsense word the listener must correctly identify each sound in that word out of an endless possible range - for real words the listener only needs to consider what English word could fit in that context - with digits the pool is even smaller
Warren 1970
- in the sentence “the state governors met with their respective legislatures convening in the capital city”, the first s in “legislature” was replaced with a cough or a tone of the same duration - listeners were given the text and asked to circle the sound that had been replaced when they heard it
true or false:
none of the subjects were able to correctly identify the location
true
Warren 1970
- in the sentence “the state governors met with their respective legislatures convening in the capital city”, the first s in “legislature” was replaced with a cough or a tone of the same duration - listeners were given the text and asked to circle the sound that had been replaced when they heard it
why were none of them able to find it?
the context provided so much information that listeners did not have to hear the first s in the word in order to identify it as “legislatures” - they automatically filled in the missing information based on their knowledge of English words
Warren 1970
- in the sentence “the state governors met with their respective legislatures convening in the capital city”, the first s in “legislature” was replaced with a cough or a tone of the same duration - listeners were given the text and asked to circle the sound that had been replaced when they heard it
both top-down and bottom-up processing could be used here but which one surpasses the other and causes the phenomenon seen?
top-down surpasses because rather than hear the sounds and determine the word, they identified the word out of the possibilities based on their English knowledge
true or false:
top-down and bottom-up processing are both useful for different contexts and one usually surpasses the other
false - using both at the same time is the most efficient way to process speech rapidly - attack from all angles - and they both can serve as checks for one another in case there is extra noise or unexpected events that make one less effective
Sumby and Pollack 1954
- whether listeners can identify words better in noise if they can see the speaker’s face than if they can just hear their voice - varying noise levels over words in both conditions
true or false:
at higher noise levels the accuracy was lower and the longer the word list the lower the accuracy
true
Sumby and Pollack 1954
- whether listeners can identify words better in noise if they can see the speaker’s face than if they can just hear their voice - varying noise levels over words in both conditions
true or false:
accuracy was greater in the auditory + visual condition, and even more so at higher noise levels
true
true or false:
visual information can change what sounds we actually hear
true
McGurk and MacDonald 1976
- recordings were made of the sounds “baba”, “gaga”, “papa”, and “kaka” and the audios and videos were mismatched - the two conditions were video-audio and audio only shown to adults, preschoolers, and primary school kids - subjects were asked to repeat what they heard
true or false:
the video-audio condition had many errors, with subjects resorting to a compromise between what they saw and heard (e.g. v: “gaga”, a: “baba”, they responded with “dada”)
true
McGurk and MacDonald 1976
- recordings were made of the sounds “baba”, “gaga”, “papa”, and “kaka” and the audios and videos were mismatched - the two conditions were video-audio and audio only shown to adults, preschoolers, and primary school kids - subjects were asked to repeat what they heard
what was the unexpected result gotten from this study?
the children were much less susceptible by the McGurk effect and were able to usually correctly identify the auditory stimulus regardless of the video - however not perfectly only 27.5% and 46.5%
McGurk and MacDonald 1976
- recordings were made of the sounds “baba”, “gaga”, “papa”, and “kaka” and the audios and videos were mismatched - the two conditions were video-audio and audio only shown to adults, preschoolers, and primary school kids - subjects were asked to repeat what they hear
at what age did the study find people have fully learned to use visual information?
after age 8, relatively late
the McGurk effect is very robust - what are some examples?
it held true even when the subjects were informed of how the stimuli were constructed, the audio and the video were out of sync by as much as 180 ms, the audio and video were different genders, the audio and video are up to 90 degrees apart in location relative to the listener, the video is reduced to a set of light points corresponding to face locations
response to integrated audiovisual information is particularly strong where?
superior temporal sulcus - below the primary speech processing regions
true or false:
reflexive phonation occurs from 0-2 months and is coughing, sneezing, and crying
true
what is an infants vocal tract like?
like a chimp’s, the tongue takes up much of the space in the mouth (more than adulthood) and the larynx is high enough there is no appreciable pharynx
at what point in development does a human’s vocal tract develop to the adult form?
the first year
true or false:
cooing occurs from 1-4 months and is quasivocalic sounds
true
true or false:
expansion occurs from 3-8 months and is clear vowels, yells, screams, whispers, and raspberries
true
true or false:
canonical babbling is rhythmically organized, meaningless sequences of speech sounds
false - it is strings of alternating consonants and vowels like “bababa” or “mamama”
when does canonical babbling occur in child development?
5-10 months
true or false:
early babbling sounds different depending on the language environment
false - sounds the same no matter than language
true or false:
towards the beginning of the babbling process, adults can tell if the babbling is of their language or not
false - they can tell towards the end of the process - their production is gradually tuned to match the language environment
which portion of babbling resembles the intonation and rhythm of the ambient language?
late babbling
at what point in development is the typical onset of meaningful speech?
10 months
true or false:
children’s production abilities are always considerably ahead of their perceptual abilities
false - their perception is always ahead of production
true or false:
fetuses in utero have higher heart rates when listening to a recording of their mothers voice rather than any other person
true
true or false:
the speech children produce is representative of what they know about their language
false - it is never fully representative
true or false:
children can distinguish sounds they can not produce themselves
true
Werker and Tees 1984
EXPERIMENT 1
- recordings of English speakers “da” and “ba”, Thompson Salish speakers “k’i” and “q’i” and Hindi speakers “ta” and “ʈa” - 6-month olds, English speakers, Thompson Salish speakers asked to identify whether they heard “k’i” or “q’i” (button or head turn) - criterion response is 8/10 correct
true or false:
infants from an English speaking environment were much better at distinguishing the sounds than English speaking adults and were almost as good as the Thompson Salish adults
true - infants > adults
Werker and Tees 1984
EXPERIMENT 1
- recordings of English speakers “da” and “ba”, Thompson Salish speakers “k’i” and “q’i” and Hindi speakers “ta” and “ʈa” - 6-8 months, 8-10 months, 10-12 months head turned - criterion response is 8/10 correct
what was the results of this study?
the youngest group vastly outperformed the others - most in the oldest group couldn’t even reach the criterion
Werker and Tees 1984
EXPERIMENT 1
- recordings of English speakers “da” and “ba”, Thompson Salish speakers “k’i” and “q’i” and Hindi speakers “ta” and “ʈa” - 6-8 months, 8-10 months, 10-12 months head turned - criterion response is 8/10 correct
what type of experiment is this?
cross-sectional study - different ages
Werker and Tees 1984
EXPERIMENT 3
- recordings of English speakers “da” and “ba”, Thompson Salish speakers “k’i” and “q’i” and Hindi speakers “ta” and “ʈa” - the same group of children over time head turn tested - criterion response is 8/10 correct
at what point in their development were the children best at discriminating the sounds?
at what point did they lose all ability to do so?
6-8 months
a year
as children focus on a particular language to master in their environment they lose something else - what is it?
the ability to distinguish sounds they aren’t exposed to regularly - they go from versatile generalists -> specializers
Khul, Tsao and Liu 2003
- 32 infants aged average 9.3 months, all English environment no Mandarin - 16 were exposed to Mandarin, other 16 exposed to English - tested if they could differentiate between 2 Mandarin sounds
true or false:
Mandarin environment children scored higher differentiating than the English group exposed to Mandarin
false - they scored about the same
Khul, Tsao and Liu 2003
- 32 infants aged average 9.3 months, all English environment no Mandarin - 16 were exposed to Mandarin interaction, other 16 exposed to English interaction - tested if they could differentiate between 2 Mandarin sounds
true or false:
the English kids who were exposed to Mandarin scored an average of 65.7% while the kids exposed to English scored 56.7%
true
Khul, Tsao and Liu 2003
EXPERIMENT 2
- 32 infants aged average 9.3 months, all English environment no Mandarin - 16 were exposed to Mandarin video, other 16 exposed to English video - tested if they could differentiate between 2 Mandarin sounds
true or false:
the kids exposed to Mandarin did no gain any advantage in the test because they were not being interacted with - they were just listening to a video
true
true or false:
phonetic learning is socially driven
true - interaction required
Khul, Tsao and Liu 2003
- 32 infants aged average 9.3 months, all English environment no Mandarin - 16 were exposed to Mandarin, other 16 exposed to English - tested if they could differentiate between 2 Mandarin sounds
what was the conclusion of this study?
even limited exposure to another language can improve discrimination of sounds in that language
true or false:
liquids are mastered late
true
true or false:
trills are mastered early
false - late
true or false:
fricatives are mastered early
false - late
true or false:
stops (nasal and oral) are mastered early
true
true or false:
vowels are mastered early
true
true or false:
alternating consonant-vowel (CV) sequences are mastered early
true
true or false:
when children can’t pronounce words while learning to speak they just ignore/skip over the word
false - they systematically replace sounds they can’t produce with ones they can (patterns of replacement)
what are some of the common replacement patterns children use? (5)
- replacing non-stops with stops (John -> don)
- cluster simplification (spoon -> boon)
- replacing consonants so they are harmonious in place of articulation (sock-> gock)
- changing voicing patterns (initial stop always voiced, final always voiceless)
- final consonants are deleted
at what point in development is there usually an explosion of vocabulary for children?
18 months
true or false:
children take a while to speak because they are held back by their inability to distinguish adult sounds
false
true or false:
children learn to speak late because the relevant muscles of their vocal tracts are not yet strong enough (will be after about a year)
true - their challenge is in coordination and control of speech gestures
why are stops easy for children to master?
they require little motor control, the tongue or lip just has to move to touch an opposing surface
why are vowels easy for children to master?
they require little motor control, and each vowel has a relatively wide range of acceptable vocal tract shapes
why are consonant-vowel alternations easy for children to master?
they are just a repeated sequence of opening and closing gestures
why are fricatives and approximants harder for children to master?
the tongue or lip has to be very precisely positioned to form a passageway narrow enough for turbulence but not too narrow
why are the sounds l and r particularly hard for children to master?
they require precision AND require different parts of the tongue to make separate closures - at first children only use the tongue as one mass
true or false:
the sounds of one’s language aren’t actually simpler to produce or distinguish, the native speaker is just more used to them
true
true or false:
it wasn’t until the 1980’s that psychologists and linguists started doing systematic acoustic studies of early speech
false - the 1970’s`
true or false:
new speech skills are mastered by kids instantaneously
false - it was believed to be the case because the gradual learning of children is too small for adults to hear
Macken and Barton 1980
- 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured
what are the expected VOT results for adults to produce?
word-initial voiced stops have a small positive VOT, voiceless stops are aspirated with a large positive VOT
what is VOT?
the time interval from the release of a stop to the onset of voicing
Macken and Barton 1980
- 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured
true or false:
in early sessions kids generally had negative VOT for both voiced and voiceless stops
false - they had small positive VOT for both voiceless and voiced - it was a voiceless unaspirated stop often misheard by transcribers as voiced
Macken and Barton 1980
- 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured
true or false:
over the course of the study, the difference in VOT between voiced and voiceless grew, mainly through an increase in the VOT of the voiced class
false - an increase in VOT of the voiceless class
Macken and Barton 1980
- 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured
why were the changes in kids VOT previously seen as instantaneous?
the difference was too small for adults to hear
Macken and Barton 1980
- 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured
at what point could the adults tell the difference between the children’s productions?
when their VOT met up with the adults average VOT for that sound
Macken and Barton 1980
- 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured
true or false:
this was one of the first studies to show a the gradualness of acquiring phonetic mastery
true
Macken and Barton 1980
- 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured
at what age did they find that the difference in voiced and voiceless stops is acquired?
around 2 years
by what age have children generally mastered almost all the sounds of their language?
5 years old
true or false:
children acquire language with instruction
false - no instruction
true or false:
the earlier one is exposed to a language, the more likely it is they will attain a native level knowledge of it
true
what makes up a foreign accent?
a pattern of pronunciation of a language by someone who applies the habits of their L1 to the speaking of L2
what is an early bilingual?
someone who learned both languages early enough to be a native speaker of both
what is a late bilingual?
someone who acquired more than one language late enough not to be a native speaker of it/them
by what age can bilinguals be exposed to an L2 and speak it without a detectable foreign accent?
age 6
true or false:
with first exposure to a language after 12, the speaker will generally have a foreign accent in the L2
false - age 13
true or false:
learners can lessen their L1 habits with L2 even when regularly being exposed to it
false - they are less likely to shake those habits if they have regular exposure to L1
what is being referenced when saying the “age of first exposure” to a language?
when the person moved to where the L2 is spoken, NOT when they started classes
true or false:
after age 6 it is no possible to become fluent in another language
false - it is likely you will have an accent but fluency is possible with practice
what is code-switching?
when a bilingual switches from one language to another under the control of the speaker
when a fluent bilingual code-switches, the languages become intertwined, rules and patterns often shifting to the other language
false - the languages are autonomous and separable
what is cross-linguistic priming?
exposure to an item in one language facilitates processing of a related item in another language
Kim et al. 1997
- two groups of speakers: early and late bilinguals both asked to describe to themselves silently typical events in their lives in L1 and L2 with fMRI monitoring the brain activity
true or false:
there was greater overlap in the areas of activity in early bilinguals than late
true
Kim et al. 1997
- two groups of speakers: early and late bilinguals both asked to describe to themselves silently typical events in their lives in L1 and L2 with fMRI monitoring the brain activity
what are the results found in reference to the prime areas of speech processing?
those areas are occupied early in life and are not available for learning languages later, late L2 acquirers have to use brain areas away from the L1 centers (on the scan early bil. had two colors overlapping greatly, late bil. had colors completely separate and next to each other
what is the transfer effect?
when the deeply entrenched set of automatic habits for L1 are applied onto F2
true or false:
unfamiliar sounds in L2 are replaced with familiar sounds of L1
true - systematically replaced
most dialects in Spanish don’t have a distinction between what?
tense and lax vowels
when a Spanish speaker is speaking English, one replacement strategy might be to switch which vowels for which?
the lax vowels of English with the tense vowels of Spanish
Spanish speakers of English are likely to replace English diphthongs with what?
their closest equivilent monophthongs [e] and [o]
the most common English vowel [ə] is often replaced by Spanish speakers with what?
[a] - only Spanish central vowel
in Spanish voiceless plosives p, t, and k are what?
unaspirated in all positions
when a Spanish speaker speaks English they will generally replace the voiceless plosives at the syllable-initial position with what?
unaspirated plosives
Spanish speakers tend to replace the English r with what?
[ɾ] the alveolar tap
Spanish speakers tend to replace English voiced fricatives with what?
voiceless Spanish ones
true or false:
Spanish speakers often do consonant deletion or vowel insertion in English words with three consonants at the onset
true - Spanish can only have at most 2 consonants there
true or false:
Spanish speakers often do consonant deletion when speaking English words that have 3 consonants in the final position
true - Spanish only allows for at most 2 consonants there
true or false:
everyone starts replacing L2 sounds with L1 equivalents
true
why do different speakers of an L2 have different accents?
they vary in where they are in the learning process and at what age they began
Fledge 1991
- compared VOT in word-initial /t/ among Spanish speakers, Spanish learners of English, and English monolinguals
true or false:
when speaking Spanish, Spanish speaking mono., early bilinguals, and late bilinguals all had VOT values in the same range
true
Fledge 1991
- compared VOT in word-initial /t/ among Spanish speakers, Spanish learners of English, and English monolinguals
true or false:
late monolinguals (learned English after age 6) produced /t/ with a VOT in between that for Spanish and English
true
Fledge 1991
- compared VOT in word-initial /t/ among Spanish speakers, Spanish learners of English, and English monolinguals
true or false:
when speaking English, early Spanish-English bilinguals had the same VOT values as English monolinguals
true
if a German speaker were to speak English what is a replacement at the end of a word they would make due to their L1 being German?
German has no voiced final obstruents so they would replace the voiced final obstruents of English with their voiceless counterparts (ex: Bob -> Bop)
German has no dental fricatives, so they replace English ones with what?
stops or affricates
what is the main factor of a foreign accent?
the sounds in L2 that have no counterpart in L1 will tend to be replaced by the closest sound in L1
besides replacing sounds with similar ones, L2 speakers often will replace what?
replace any sound that occurs in both L1 and L2 if it is occurring in a position in which it couldn’t occur in L1
why is a foreign accent so persistent?
over our lifetimes we learn processes of production and perception that become automatic which allows us to speak and keep up quickly - mastering these skills becomes a liability when learning another language because of how automatic and engrained they are
true or false:
consciously realizing that a similar sound in L1 and L2 is actually different can change the unconscious automatic process of producing and perceiving it
false - it does NOT change the unconscious process
why do early bilinguals not have a foreign accent?
they are exposed to both languages early enough to build separate categories for each and have no trouble keeping them separate
true or false:
the hardest sounds for L2 learners in the longrun are the new sounds unlike what they’ve heard before
false - the hardest are the “false friends” that are close to those in L1 but not the same
why are the most similar sounds hardest to master?
because they are similar enough to ones we already know that we subconsciously believe it is ok to just replace them with English ones
Flege and Hillenbrand 1984
- speakers of varying French knowledge produced French sentences including words “tous” and “tu” - [y] in “tu” has no English counterpart but [u] in “tous” is close - native French speakers had to identify which word their heard
true or false:
for the group with the lease French experience, their “tu” was much more easily identifiable than their “tous” by the French speakers meaning they pronounced it better
true
Flege and Hillenbrand 1984
- speakers of varying French knowledge produced French sentences including words “tous” and “tu” - [y] in “tu” has no English counterpart but [u] in “tous” is close - native French speakers had to identify which word their heard
true or false:
there was no significant difference in the ability of the French speakers to identify the “tu” of the most experienced French speakers and the least experienced
true - they were equally good at pronouncing the newer/weirder sound
Flege and Hillenbrand 1984
- speakers of varying French knowledge produced French sentences including words “tous” and “tu” - [y] in “tu” has no English counterpart but [u] in “tous” is close - native French speakers had to identify which word their heard
true or false:
the results were that the non-native speakers got closer in F2 values to the native French for the familiar [u] than for the new [y]
false - they were closer for the newer [y]
what is speech technology?
any interface between humans and computers involving speech
what is speech recognition?
automatic identification of spoken words
what is speaker recognition?
automatic identification of the person who spoke
what is speech synthesis?
the production of speech by machines
why do companies want to use more speech recognition?
the more they can automate customer service and sales, the fewer the human employees it needs to pay
true or false:
humans are more comfortable typing than speaking so they generally prefer a typing interface to a speech one
false - they prefer speaking to typing
why does the government want to invest in speech recognition?
they want an automatic method for filtering recorded speech to locate particular references or voices
how does speech recognition work?
digitized recordings (numerical version of a spectrogram) of speech samples are stored in memory, each is labeled to identify what is said, when a new word is said it is digitized too and compared to all the samples in the memory point by point, the program selects the soundfile in memory with the smallest summed difference from the new soundfile
in a quantized spectrogram, what do the lighter vs. darker colors represent?
the lighter are lower amplitude, the darker are higher amplitude
what is the challenge of alignment in speech recognition?
its hard to choose which point to use as the memory sample because even two productions of the same word by the same speaker won’t sound the same
how is the challenge of alignment between two files solved?
expanding or contrasting the timescale to find the best match (time warping)
what is the challenge of segmentation for speech recognition?
people don’t pause between words so it’s not clear what interval in the soundfile needs to match to the file in memory
how is the challenge of segmentation solved in speech recognition?
they push for one word answers or they have to try different segmentations and check which is the best fit
what is the problem of vocabulary size for speech recognition?
the larger the vocabulary the system has to keep in memory, the more words it has to search through, the longer it takes and will be more likely to have incorrect responses
what is the challenge of variability in speech recognition?
any given word is pronounced differently by different speakers - early dictation programs were speaker-dependent, in order to be speaker-independent it must have a huge variety of speakers
in order to be effective a speech recognition program needs to be what?
adaptive
do successful speech recognition programs use top-down or bottom-up processing?
top-down processing
true or false:
the challenges for speech tech are the same challenges face by human listeners
true - segmentation, variability due to speaker, variability due to context, and speech errors
true or false:
a program can distinguish between two voices that it’s never heard before
false - it can’t
how does speech synthesis work?
from a digital recording, sound is converted into a series of numbers representing amplitude at each instant in each freq. band
why do speech syntheses not sound like humans?
they typically get the intonation wrong, and do not accurately mimic the effects of coarticulation