Exam 3 Flashcards by Maia A

what is the external part of the ear called?

the pinna

How well did you know this?

Not at all

Perfectly

what does the outer ear consist of?

pinna, ear canal, and the eardrum

How well did you know this?

Not at all

Perfectly

what does the middle ear consist of?

from ear drum to the oval window: contains three small bones malleus, incus, and stapes

How well did you know this?

Not at all

Perfectly

passage through the middle ear does what to the sound?

amplifies it

How well did you know this?

Not at all

Perfectly

what does the inner ear consist of?

semicircular canal and cochlea

How well did you know this?

Not at all

Perfectly

what happens in the cochlea?

mechanical sound waves are converted to electrical nerve impulses

How well did you know this?

Not at all

Perfectly

for an unwound cochlea, there is a thicker and thinner end which is which? (Apical or Basal end)

thicker is the Apical end, thinner is the Basal end

How well did you know this?

Not at all

Perfectly

for an unwound cochlea which frequencies do the Apical and Basal ends move more for?

the Apical end moves more for lower frequencies because thicker = lower resonant frequency
the Basal end moves more for high frequencies because thin = higher resonant frequency

How well did you know this?

Not at all

Perfectly

the basilar membrane has a thick and thin end which are which (Apical and Basal)

Apical is thick and Basal is thin

How well did you know this?

Not at all

Perfectly

the basilar membrane is tonotopically organized - what does that mean?

different locations on the membrane correspond to different frequencies

How well did you know this?

Not at all

Perfectly

Denes and Pinson 1993 : 90
- shows how far the basilar membrane is pushed out of place by different frequencies

What was the conclusion found?

lower frequencies (25 Hz) are higher farther (30 mm) from the stapes than higher frequencies (1600 Hz –> 17 mm)

How well did you know this?

Not at all

Perfectly

explain how hair cells work - what is their role?

they are attached to the basilar membrane, the hair cell fires if movement of the basilar membrane pushes the cell out of position sufficiently

How well did you know this?

Not at all

Perfectly

what is the response curve of a hair cell?

shows the lowest intensity at which a pure tone at a given frequency triggers a firing of the cell - the low point shows the freq. the hair cell responds to most readily - the closer to the apical end (thick) the lower the resonant frequency

How well did you know this?

Not at all

Perfectly

Moore 1997 : 33
- shows the response curves for different hair cells

What does the lowest point show?

the lowest point is the characteristic freq. where it will fire at the lowest amplitude

How well did you know this?

Not at all

Perfectly

what is the most important factor of a hair cell?

the location of it - they are all the same otherwise

How well did you know this?

Not at all

Perfectly

what causes hair loss at certain frequencies?

the hair cells are pushed too far and sheared off

How well did you know this?

Not at all

Perfectly

the outer hair cells are different from inner how?

when the outer hair cells fire they change length to push back on the basilar membrane and amplify the signal

How well did you know this?

Not at all

Perfectly

Denes and Pinson 1993 : 95
- shows human’s hearing range

What does this show about our speech sounds as humans?
What is the peak sensitivity?

speech sounds evolved to be where our hearing is particularly good - the peak sensitivity is between 1000 and 10,000 Hz

How well did you know this?

Not at all

Perfectly

tonotopically organized signals from the ear are passed to the brain through what?

the auditory nerve, through various bodies in the brainstem and to the cerebrum (uppermost and outermost part of the brain)

How well did you know this?

Not at all

Perfectly

signals from the right ear are passed to where? what is this called?

the left hemisphere of the brain - decussation

How well did you know this?

Not at all

Perfectly

where is the auditory cortex located and what does it border?

in each hemisphere of the temporal lobe of the cerebral cortex on the superior temporal gyrus (STG) - borders the lateral (Sylvian) fissure

How well did you know this?

Not at all

Perfectly

what is the primary auditory cortex?

entryway into the cerebral cortex for signals from the ears

How well did you know this?

Not at all

Perfectly

how is the primary auditory cortex organized?

tonotopically - different locations correspond to different frequency bands

How well did you know this?

Not at all

Perfectly

the frequency-based locations in the primary auditory cortex correspond to what?

frequency-sensitive locations on the basilar membrane

How well did you know this?

Not at all

Perfectly

damage to the primary auditory cortex could cause what?

aphasia

Bear et al. 2007 - both hemispheres of the brain have an auditory cortex But what?

but one is dominant to speech processing - left for 93% of people (96% of right-handed, 70% left handed)

what is dichotic listening?

speech materials are processed in the opposite hemisphere of the ear it receives it from, therefore there is often a right-ear processing advantage for speech but NOT for non-speech sounds like music or humming

what is Wernicke's Area and where is it?

middle region of the STG that if injured causes problems with perception and comprehension (Wernicke's Aphasia)

where is Wernicke's Area in relation to the auditory cortex?

posterior, the auditory cortex is the "bottom" part of the STG

when and who discovered Wernicke's area?

1874, German neurologist Karl Wernicke - it was early evidence for brain area specialization

True or false: electrical stimulation of Wernicke's area interferes with identification of speech sounds, discrimination between speech sounds, and comprehension of speech

true

what are combination-sensitive neurons?

in the STG, respond to particular patterns of frequency and amplitude - they fire only if there is an activation of a particular combination of primary cells

Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography). True or false: this study was testing to see which parts of the brain activated when the patient was producing speech and when they were not.

False - the study was to see which parts of the brain were active when speech was playing, but inactive during silence

Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography). True or false: the patients passively listened to 500 samples of SAE sentences

True

Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography). True or false: researchers found that when passively listening to speech, the STG was activated constantly, but was not in silence

False - different groups of neurons in the STG activated for different classes of sounds

Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography). True or false: e1 responded to the sibilant fricatives /s, ʃ , z/

False - e1 responded to the plosives /b, d, g, p, t, k/

Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography). True or false: e2 responded to the sibilant fricatives /s, ʃ , z/

true

Mesgarani, Cheung, Johnson and Chang (2014) - study of 6 adults whose skulls were opened for epilepsy surgery, electrodes were placed on the surface of the left STG (electrocorticography). True or false: e3 responded to the low-back vocoids (vowels and glides) /ɑ, aʊ/

true

False - e4 responded to the high-front vocoids /i, j/

true

Mesgarani, Cheung, Johnson and Chang (2014) What is PSI?

phoneme selectivity index, represents the number of other phonemes statistically distinguishable from that phoneme in the response of a specific electrode

Mesgarani, Cheung, Johnson and Chang (2014) what does a PSI = 0 mean?

that electrode does NOT distinguish between that phoneme and any others

Mesgarani, Cheung, Johnson and Chang (2014) true or false: PSI = 32 means the electrode can detect 32 phonemes

false - it means the electrode is maximally selective, the phoneme is distinguishable from all other phonemes in the response of that electrode

Mesgarani, Cheung, Johnson and Chang (2014) true or false: neurons sensitive to a particular acoustic combination are located near neurons sensitive to similar combinations and are therefore tonotopically organized

false - they are organized by phonetic category

Mesgarani, Cheung, Johnson and Chang (2014) true or false: from the basilar membrane to the primary auditory cortex, sound is represented tonotopically in the form of time-varying frequency spectrum, corresponding to a spectrogram

true

what is a category?

a set of entities or events that all elicit an equivalent response

categories are essential to learning and cognition - why?

we can only generalize particular experiences to general knowledge through the use of categories

true or false: speech categories are the same across people and situations

false - they vary greatly from speaker to speaker and context to context; each person has a broad range of phonetic events they pull from to decode a word or sound

true or false: an acoustic continuum is a series of items that differ gradiently for a series of acoustic properties

false - only one acoustic property not multiple

true or false: an F1 continuum would be a series of items that have the same F1 but are different in other aspects

false - the items would differ ONLY in F1

true or false: in a F1 continuum, the difference in F1 of each of the items in the series is the same as the preceding member

true

Lisker & Abramson 1970 - varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language true or false: the space where the lines for either sound meets is called the perceptual/identification boundary

true

Lisker & Abramson 1970 - varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language true or false: the study found that at low VOT, English speakers identified the stop as voiceless 100% of the time

false - at low VOT the subjects identified the sounds as VOICED 100% of the time

Lisker & Abramson 1970 - varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language true or false: at high VOT the English subjects identified the stop as voiceless 100% of the time

true

Lisker & Abramson 1970 - varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language true or false: the perceptual / identification boundary is where subjects were able to tell the stops apart 100% of the time

false - the boundary is where they identified the stimulus as voiced 50% of the time and voiceless 50% of the time

true or false: Lisker and Abramson (1964) found that further forward places of articulation are associated with greater VOT values

false - places of articulation that are further back are associated with greater VOT values

Lisker & Abramson 1970 - varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language true or false: a conclusion drawn from this study is that the identification boundary is at a lower VOT for alveolars and velars than for bilabials

false - because alveolar and velar sounds are further back in the mouth = greater (higher) VOT

what is categorical perception?

listeners ignore the differences of sounds on the same side of the perceptual boundary and only discriminate sounds that lie on opposite sides

Lisker & Abramson 1970 - varied VOT in word-initial stops, using speech synthesis from -150 to 150 ms in 10 ms steps for each place of articulation (bilabial, apical, velar) - subjects who spoke Thai, Spanish, and English were asked to identify the initial consonant of the stimulus among a choice of sounds in their language true or false: this study found that speakers differentiate sounds within each side of the perceptual boundary

false - they ignore the differences of those on the same side and only discriminate sounds that lie on opposite sides of the boundary

Liberman et al 1957 - synthesized a series of stop-vowel syllables that were alike in steady-state values of F1 and F2 - they only differed in the onset value of the initial F2 transition from way above F2 steady-state to way below (hand drawn looked like eyebrows) - subjects were asked to identify as b, d, or g true or false: when F2 pointed down, subjects identified the consonant as d

false - F2 pointing down was identified as b

false - F2 was flat it was identified as d

false - F2 pointed up was identified as g

Liberman et al 1957 - synthesized a series of stop-vowel syllables that were alike in steady-state values of F1 and F2 - they only differed in the onset value of the initial F2 transition from way above F2 steady-state to way below (hand drawn looked like eyebrows) - subjects were asked to identify as b, d, or g there is only one ambiguous stop in between which two stimulus?

#3 (almost always b) and #5 (almost always d) - boundary between b and d

Liberman et al 1957 discrimination experiment: - synthesized a series of stop-vowel syllables that were alike in steady-state values of F1 and F2 - they only differed in the onset value of the initial F2 transition from way above F2 steady-state to way below (hand drawn looked like eyebrows) - subjects listened to a series of 3 syllables (b, d, or g) together (e.g. ABX) where A and B are different and X is either identical to A or B true or false: if two of the syllables were within the same category (same side) subjects found it hard to discriminate between them

true

why might humans be more sensitive to acoustic cues that distinguish categories and insensitive to those within the categories?

because the acoustic differences within categories do NOT help with our goal of identifying what sound is being produced

Miyawaki et al. 1975 - synthesized syllables with a sonorant consonant followed by [ɑ], the only difference was the frequency of F3 in the consonant (r, l) - subjects (SAE and Japanese) heard each in random order and asked to determine if they were l or r (law or raw). true or false: stimuli with a low F3 in the consonant was identified as "l" nearly 100% of the time

false - low F3 were identified as "r" nearly 100% of the time

Miyawaki et al. 1975 - synthesized syllables with a sonorant consonant followed by [ɑ], the only difference was the frequency of F3 in the consonant (r, l) - subjects (SAE and Japanese) heard each in random order and asked to determine if they were l or r (law or raw). true or false: stimuli with a high F3 in the consonant were identified as "l" nearly 100% of the time

true

Miyawaki et al. 1975 - synthesized syllables with a sonorant consonant followed by [ɑ], the only difference was the frequency of F3 in the consonant (r, l) - subjects (SAE and Japanese) heard each in random order and asked to determine if they were l or r (law or raw). there was one stimulus that could not be clearly assigned by the subjects - what does this mean and which one was it?

#7, it was the identification boundary between l and r

1. SAE speakers did well distinguishing the sounds on opposite sides of the boundary 2. SAE speakers were guessing/leaving it to chance when discriminating within the categories 3. Japanese speakers, having no contrast between the sounds in Japanese, could not distinguish the sounds

are vowels similar in discrimination to consonants? why or why not?

no they aren't, there is a perceptual boundary but it is not a peak in discriminability like consonants, it is gradable - people can discriminate within vowel categories as well as between them

what is one hypothesis as to why consonants have a perceptual boundary and vowels don't?

categorical perception may be limited to rapid, dynamic acoustic properties, like the VOT and F2 formant transitions between consonants and vowels, but vowels have steady-state formant patterns that stay the same for what in speech is a long time

what is speaker normalization?

the listener's ability to handle/understand the differences among speakers that are unlike what they have heard before

what are the 3 main ways speaker's voices differ and which one is the MAIN way?

1. MOST IMPORTANTLY differ in the formant frequencies 2. they differ in f0 (higher or lower pitch) depending on the length of their vocal chords 3. voice quality as measured in open quotient or spectral tilt

true or false: only F1 is higher in women than men

false - F1 and F2 are higher in women than men

men are generally larger than women, and women are larger than children - what does this mean in terms of their voices?

men have longer vocal tracts than women who have longer ones than children therefore men have the lowest resonant frequencies, then women's, then children - however that does not mean that all large people have the deepest voices

true or false: the difference between men and women lies mainly in the length of the pharynx

true

true or false: Peter Ladefoged has formant values for his vowels that are close to those of SAE but are not the same vowel and are therefore easily confused

false - though the formant values are close for one vowel said by him and a different one said by a SAE speaker, they do NOT get confused for each other - distinction is NOT ONLY in formant values (speaker normalization)

what is one of the biggest problems when developing automatic speech recognition software?

computers cannot, as easily or as well as humans, perform speaker normalization when encountering a new voice unlike what they've heard before

Ladefoged and Broadbent 1957 - synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence "Please say what this word is" before the next word which subjects had to identify. true or false: with the "normal" carrier sentence test word A (F1: 375 Hz) was identified as "bat"

false - with the "normal" carrier sentence test word A (375 Hz) was identified as "bit"

true

false - with the "normal" carrier sentence test word C (F1: 575 Hz) was identified as "bat"

false - with the "normal" carrier sentence test word D (F1: 600 Hz, F2: 1300 Hz) was identified as "but"

Ladefoged and Broadbent 1957 - synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence "Please say what this word is" before the next word which subjects had to identify. true or false: when F1 was lowered in the carrier sentence, test word A (375 Hz) started to be identified as "bet"

true - in the low F1 context, the value of 375 Hz counted as high in comparison so the vowel was judged to be low

Ladefoged and Broadbent 1957 - synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence "Please say what this word is" before the next word which subjects had to identify. true or false: when F1 was raised in the carrier sentence, test word B (450 Hz) started to be identified as "bat"

false - "bet" began to be identified as "bit" because with the context of high F1 values in the carrier, 450 Hz counted as low in comparison so the vowel was judged to be high

Ladefoged and Broadbent 1957 - synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence "Please say what this word is" before the next word which subjects had to identify. true or false: when F1 was raised in the carrier sentence, test word C (575 Hz) started to be identified as "bit"

false - "bat" started to be identified as "bet" because compared to the high F1 in the carrier, 575 Hz was not that high so the vowel was judged to be mid rather than low

Ladefoged and Broadbent 1957 - synthesized 4 syllables differing only in F1 and F2, in isolation the syllables were identified as bit, bet, bat, and but - they also synthesized (via F1 and F2) the syllables of a carrier sentence "Please say what this word is" before the next word which subjects had to identify. true or false: when F2 was lowered in the carrier sentence, test word D (F1: 600 Hz, F2: 1300 Hz) started to be identified as "but"

false - "but" started to be identified as "bat" because compared to the low F2 values, 1300 Hz was not all that low, and was judged to be front

listeners notice where the formants are in vowels from a new speaker and adapt their model of the vowel space to fit the new voice - their expectations change as they learn where the new speakers vowels are which can happen in a matter of seconds - intelligent problem solving NOT passive

Mullenix et al. 1989 - one group of listeners identified lists of words in noise produced by a single speaker, while another group heard the same words produced by multiple speakers true or false: the group hearing a single speaker in noise identified the words slower and less accurately that those hearing multiple speakers

false - the group hearing a single speaker in noise were faster and more accurate because over the short period of time they were able to learn more about the single voice and improve their processing of their speech

Nygaard and Pisoni 1998 - subjects listen to samples from 10 different speakers over 10 days, they learned voices well enough to match a new sample to them, and were presented with a word that needed to identify in noise. true or false: subjects made fewer errors identifying words in noise if it was produced by one of the voices they were already familiar with

true

what is priming?

previous exposure to one stimulus (the prime) improves processing performance (accuracy and speed) on the task with a later stimulus (the target)

Nygaard and Pisoni 1998 - subjects listen to samples from 10 different speakers over 10 days, they learned voices well enough to match a new sample to them, and were presented with a word that needed to identify in noise. how does priming help explain the results this study?

the priming was greater when the prime and the target were produced by the same voice which implies that the voice was part of the memory representation for the prime

Goldinger 1996, 1998 - exposed subjects to words produced by different speakers in a study session, they were tested in various tasks involving those words in a test session (e.g. have you heard this word before in the study session?) what were the results of this experiment?

the subjects were quicker and more accurate in making this judgement if they heard the word produced by the same speaker who produced it in the study session

Goldinger 1996, 1998 - exposed subjects to words produced by different speakers in a study session, they were tested in various tasks involving those words in a test session - they were asked to identify sounds in the word, discriminate sounds in the word, and repeat the word as quickly as possible (shadowing) what were the results of these tasks?

they all were done faster and more accurately if they had previously heard the test word produced by the same voice as in the test

false - it is more effective to activate both the word and the voice because they both activate memory of that word and that voice

what is an exemplar?

the memory representation of a category like the word "cat" consists of every instance of that word one has ever encountered organized by recency, speaker, context, etc.

true or false: if you heard a word recently from a speaker, you can more quickly process that word in a new instance from the same speaker

true - speaker normalization is partially responsible for this

what is the perceptual challenge of coarticulation?

there is a different version of every phoneme for every preceding sound and for every following sound - the differences can be as large as those between categories

what is the main problem with speech recognition programs?

it is hard for them to account for coarticulation so a speech sound or spoken word from one context won't match a sample of the same sound or word from another context

true or false: to counteract coarticulation problems, listeners remember vowels with the sound that precedes it, not by itself

false - they remember the sound that precedes it and the sound that follows it

F2 would be high, like it is in j

F2 of the vowel would be low, like w

Lindblom and Studdert-Kennedy 1967 - series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ true or false: when the vowel had a high F2 it was identified as ʊ no matter the context

false - it was identified as I no matter the context

Lindblom and Studdert-Kennedy 1967 - series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ true or false: when the vowel had a low F2 it was identified as ʊ no matter the context

true

Lindblom and Studdert-Kennedy 1967 - series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ true or false: when the vowel has intermediate F2 values, it was identified as ʊ more often in the low F2 w__w environment than in isolation

false - it was identified as I

Lindblom and Studdert-Kennedy 1967 - series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ true or false: when the vowel has intermediate F2 values, it was identified as I less often in the high F2 j__j environment than in isolation

true

Lindblom and Studdert-Kennedy 1967 - series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ how did the identification boundary shift for I and ʊ?

shifted toward lower values of F2 in the low F2 context and toward higher F2 values in the higher F2 context

Lindblom and Studdert-Kennedy 1967 - series of high vowels synthesized, varying just in the freq. of F2, from clear I to with high F2 to clear ʊ with low F2 - the vowels were spliced into three environments: isolation, w__w, j__j - 10 speakers of SAE listened to the words and identified them as containing I or ʊ what is the interpretation (2) of this study's results?

when listeners hear a vowel in a context that raises F2, such as j__j, they know that part of the height of F2 for that vowel is due to context - to compensate for this effect they raise the F2 boundary between I and ʊ which reduces the range of F2 values that are identified as I when listeners hear a vowel in a context that lowers F2, such as w__w, they know that part of the lowness of F2 for the vowel is due to context - to compensate for this, they lower the F2 boundary between I and ʊ which increases the range of F2 values that are identified as I

Mann and Repp 1980 - investigating the effect of a following vowel on the perceptual distinction between s and ʃ - synthesized a continuum of 9 fricatives varying in center frequency from ʃ (1957 Hz) to s (3917 Hz - each occurred with two different following vowels ɑ and u - listeners were asked to identify "sh" or "s" true or false: listeners had more "sh" responses with higher center freq. (to the left of the chart) than with lower ones (to the right)

false - more "sh" responses with lower central freq. (to the left) than with higher ones (to the right)

what is the difference between s and ʃ ?

both are sibilants with intense noise extending down from the highest freq. but the center freq. of the noise is lower in ʃ than in s (the noise extends down lower in ʃ )

Mann and Repp 1980 - investigating the effect of a following vowel on the perceptual distinction between s and ʃ - synthesized a continuum of 9 fricatives varying in center frequency from ʃ (1957 Hz) to s (3917 Hz - each occurred with two different following vowels ɑ and u - listeners were asked to identify "sh" or "s" true or false: listeners had more "sh" responses at the higher center frequencies in the context of ɑ than in the u context

true

Mann and Repp 1980 - investigating the effect of a following vowel on the perceptual distinction between s and ʃ - synthesized a continuum of 9 fricatives varying in center frequency from ʃ (1957 Hz) to s (3917 Hz - each occurred with two different following vowels ɑ and u - listeners were asked to identify "sh" or "s" what is the interpretation of the results of this study?

listeners take into account the following vowel when identifying a fricative - they know the effects of coarticulation on each sound - because u is rounded it lengthens the vocal tract and lowers the frequency so the center frequency of a sibilant is lower if it is followed by u rather than “a”

Mann and Repp 1980 - investigating the effect of a following vowel on the perceptual distinction between s and ʃ - synthesized a continuum of 9 fricatives varying in center frequency from ʃ (1957 Hz) to s (3917 Hz - each occurred with two different following vowels ɑ and u - listeners were asked to identify "sh" or "s" true or false: in order to identify a fricative as ʃ, the center frequency has to be lower before u

true - because listeners attribute some lowness of center freq. there to the vowel context

Mann and Repp 1980 - investigating the effect of a following vowel on the perceptual distinction between s and ʃ - synthesized a continuum of 9 fricatives varying in center frequency from ʃ (1957 Hz) to s (3917 Hz - each occurred with two different following vowels ɑ and u - listeners were asked to identify "sh" or "s" true or false: knowing that u has a lowering effect on center frequency, listeners adjust the expected center frequency for s upward

false - they adjust it downward

West 1999 - investigate the coarticulatory effects of approximants l and r on neighboring vowels - for the words "a berry" and "a belly" the coarticulation effects extended so far, all vowels had differences in F1 - F3 depending on the medial consonant - the liquid was replaced with noise and listeners were asked to identify which word they heard what did the study find?

the subjects were able to identify the word correctly if noise covered most of the word as long as they could still hear the vowel preceding the liquid

our identification is noise-resistant since it spreads out the acoustic evidence a listener can use in identification

what is bottom-up processing?

when one figures out the bigger units on the basis of the smaller units they contain for example determining what sound is being produced by the sounds it contains

what is top-down processing?

figuring out the smaller constituent units on the basis of bigger units that contain them for example the identification of speech sounds is informed by out knowledge of words in our language

Ganong 1980 - synthesized words varying just in the VOT of an initial alveolar or velar stop - in one class of series (word-nonword), a voiced stop in that position would form a word and a voiceless stop would form a non-word (e.g "dash" and "tash") - in the other condition (nonword-word) an initial voiced stop would form a non-word and a voiceless stop would form a word (e.g "dask" and "task") - listeners had to identify the inital sound as voiced or voiceless what was the measure of this study?

the percentage of times that the subject identified a stimulus as voiced (d or g) with 2 factors: VOT of the initial stop and word status (word-nonword or nonword-word)

false - stops with lower VOT values will more often be identified as voiced

true

the condition of a voiced consonant actually made a word so people used their knowledge of words to identify the consonant

bottom-up processing - the physical properties of an individual sound help identify the word

top-down processing - using knowledge of vocabulary to help decide what sound they heard

what are phonotactic restrictions?

generalizations about what sequences of sounds can occur in some position in an utterance for example at the beginning of a syllable

Massaro and Cohen 1983 - speech synthesis to create a series of syllables with a liquid preceding the vowel [i] - the syllables only differed in F3 (low F3 = r, high F3 = l) - ree/lee were preceded by a synthesized consonant: p, t, s, or v - subjects were asked to select of the possible combinations which they heard what was the main goal of this study?

it was expected that the lower F3, the more likely subjects will choose sounds with r not l - but would the identification differ depending on the preceding consonant?

true

p is compatible with either l or r and v is compatible with neither - the effect being tested here does not apply to these

when F3 was highest, subjects identified it as l regardless or what preceded it, when F3 was lowest, subjects identified it as r regardless of what preceded it, but over all F3 values they avoided identifying sound sequences that cannot occur in English

subjects used their phonotactic knowledge of where English sounds can occur as a comparison to what they heard to see the likelihood of that sound occurring in that context

what is syntax?

how words fit together in sentences

how do we use syntax in speech processing?

we use out knowledge of it to restrict the possible words that could fill a given slot we are trying to identify

Miller, Heise, and Lichten 1951 - presented words (real and nonsense) to subjects at varying signal-to-noise rations and in different contexts and had to identify what was being said true or false: the higher the signal to noise ratio, the more accurate the identification was

true - quieter noise, louder speech

when real words in sentences were used and even better with digits in a sequence

Miller, Heise, and Lichten 1951 - presented words (real and nonsense) to subjects at varying signal-to-noise rations and in different contexts and had to identify what was being said why was it significantly harder for subjects to identify nonsense words than actual words in a sentence?

for a nonsense word the listener must correctly identify each sound in that word out of an endless possible range - for real words the listener only needs to consider what English word could fit in that context - with digits the pool is even smaller

Warren 1970 - in the sentence "the state governors met with their respective legislatures convening in the capital city", the first s in "legislature" was replaced with a cough or a tone of the same duration - listeners were given the text and asked to circle the sound that had been replaced when they heard it true or false: none of the subjects were able to correctly identify the location

true

the context provided so much information that listeners did not have to hear the first s in the word in order to identify it as "legislatures" - they automatically filled in the missing information based on their knowledge of English words

Warren 1970 - in the sentence "the state governors met with their respective legislatures convening in the capital city", the first s in "legislature" was replaced with a cough or a tone of the same duration - listeners were given the text and asked to circle the sound that had been replaced when they heard it both top-down and bottom-up processing could be used here but which one surpasses the other and causes the phenomenon seen?

top-down surpasses because rather than hear the sounds and determine the word, they identified the word out of the possibilities based on their English knowledge

true or false: top-down and bottom-up processing are both useful for different contexts and one usually surpasses the other

false - using both at the same time is the most efficient way to process speech rapidly - attack from all angles - and they both can serve as checks for one another in case there is extra noise or unexpected events that make one less effective

Sumby and Pollack 1954 - whether listeners can identify words better in noise if they can see the speaker's face than if they can just hear their voice - varying noise levels over words in both conditions true or false: at higher noise levels the accuracy was lower and the longer the word list the lower the accuracy

true

Sumby and Pollack 1954 - whether listeners can identify words better in noise if they can see the speaker's face than if they can just hear their voice - varying noise levels over words in both conditions true or false: accuracy was greater in the auditory + visual condition, and even more so at higher noise levels

true

true or false: visual information can change what sounds we actually hear

true

McGurk and MacDonald 1976 - recordings were made of the sounds "baba", "gaga", "papa", and "kaka" and the audios and videos were mismatched - the two conditions were video-audio and audio only shown to adults, preschoolers, and primary school kids - subjects were asked to repeat what they heard true or false: the video-audio condition had many errors, with subjects resorting to a compromise between what they saw and heard (e.g. v: "gaga", a: "baba", they responded with "dada")

true

McGurk and MacDonald 1976 - recordings were made of the sounds "baba", "gaga", "papa", and "kaka" and the audios and videos were mismatched - the two conditions were video-audio and audio only shown to adults, preschoolers, and primary school kids - subjects were asked to repeat what they heard what was the unexpected result gotten from this study?

the children were much less susceptible by the McGurk effect and were able to usually correctly identify the auditory stimulus regardless of the video - however not perfectly only 27.5% and 46.5%

McGurk and MacDonald 1976 - recordings were made of the sounds "baba", "gaga", "papa", and "kaka" and the audios and videos were mismatched - the two conditions were video-audio and audio only shown to adults, preschoolers, and primary school kids - subjects were asked to repeat what they hear at what age did the study find people have fully learned to use visual information?

after age 8, relatively late

the McGurk effect is very robust - what are some examples?

it held true even when the subjects were informed of how the stimuli were constructed, the audio and the video were out of sync by as much as 180 ms, the audio and video were different genders, the audio and video are up to 90 degrees apart in location relative to the listener, the video is reduced to a set of light points corresponding to face locations

response to integrated audiovisual information is particularly strong where?

superior temporal sulcus - below the primary speech processing regions

true or false: reflexive phonation occurs from 0-2 months and is coughing, sneezing, and crying

true

what is an infants vocal tract like?

like a chimp's, the tongue takes up much of the space in the mouth (more than adulthood) and the larynx is high enough there is no appreciable pharynx

at what point in development does a human's vocal tract develop to the adult form?

the first year

true or false: cooing occurs from 1-4 months and is quasivocalic sounds

true

true or false: expansion occurs from 3-8 months and is clear vowels, yells, screams, whispers, and raspberries

true

true or false: canonical babbling is rhythmically organized, meaningless sequences of speech sounds

false - it is strings of alternating consonants and vowels like "bababa" or "mamama"

when does canonical babbling occur in child development?

5-10 months

true or false: early babbling sounds different depending on the language environment

false - sounds the same no matter than language

true or false: towards the beginning of the babbling process, adults can tell if the babbling is of their language or not

false - they can tell towards the end of the process - their production is gradually tuned to match the language environment

which portion of babbling resembles the intonation and rhythm of the ambient language?

late babbling

at what point in development is the typical onset of meaningful speech?

10 months

true or false: children's production abilities are always considerably ahead of their perceptual abilities

false - their perception is always ahead of production

true or false: fetuses in utero have higher heart rates when listening to a recording of their mothers voice rather than any other person

true

true or false: the speech children produce is representative of what they know about their language

false - it is never fully representative

true or false: children can distinguish sounds they can not produce themselves

true

Werker and Tees 1984 EXPERIMENT 1 - recordings of English speakers "da" and "ba", Thompson Salish speakers "k'i" and "q'i" and Hindi speakers "ta" and "ʈa" - 6-month olds, English speakers, Thompson Salish speakers asked to identify whether they heard "k'i" or "q'i" (button or head turn) - criterion response is 8/10 correct true or false: infants from an English speaking environment were much better at distinguishing the sounds than English speaking adults and were almost as good as the Thompson Salish adults

true - infants > adults

Werker and Tees 1984 EXPERIMENT 1 - recordings of English speakers "da" and "ba", Thompson Salish speakers "k'i" and "q'i" and Hindi speakers "ta" and "ʈa" - 6-8 months, 8-10 months, 10-12 months head turned - criterion response is 8/10 correct what was the results of this study?

the youngest group vastly outperformed the others - most in the oldest group couldn't even reach the criterion

cross-sectional study - different ages

Werker and Tees 1984 EXPERIMENT 3 - recordings of English speakers "da" and "ba", Thompson Salish speakers "k'i" and "q'i" and Hindi speakers "ta" and "ʈa" - the same group of children over time head turn tested - criterion response is 8/10 correct at what point in their development were the children best at discriminating the sounds? at what point did they lose all ability to do so?

6-8 months a year

as children focus on a particular language to master in their environment they lose something else - what is it?

the ability to distinguish sounds they aren't exposed to regularly - they go from versatile generalists -> specializers

Khul, Tsao and Liu 2003 - 32 infants aged average 9.3 months, all English environment no Mandarin - 16 were exposed to Mandarin, other 16 exposed to English - tested if they could differentiate between 2 Mandarin sounds true or false: Mandarin environment children scored higher differentiating than the English group exposed to Mandarin

false - they scored about the same

Khul, Tsao and Liu 2003 - 32 infants aged average 9.3 months, all English environment no Mandarin - 16 were exposed to Mandarin interaction, other 16 exposed to English interaction - tested if they could differentiate between 2 Mandarin sounds true or false: the English kids who were exposed to Mandarin scored an average of 65.7% while the kids exposed to English scored 56.7%

true

Khul, Tsao and Liu 2003 EXPERIMENT 2 - 32 infants aged average 9.3 months, all English environment no Mandarin - 16 were exposed to Mandarin video, other 16 exposed to English video - tested if they could differentiate between 2 Mandarin sounds true or false: the kids exposed to Mandarin did no gain any advantage in the test because they were not being interacted with - they were just listening to a video

true

true or false: phonetic learning is socially driven

true - interaction required

even limited exposure to another language can improve discrimination of sounds in that language

true or false: liquids are mastered late

true

true or false: trills are mastered early

false - late

true or false: fricatives are mastered early

false - late

true or false: stops (nasal and oral) are mastered early

true

true or false: vowels are mastered early

true

true or false: alternating consonant-vowel (CV) sequences are mastered early

true

true or false: when children can't pronounce words while learning to speak they just ignore/skip over the word

false - they systematically replace sounds they can't produce with ones they can (patterns of replacement)

what are some of the common replacement patterns children use? (5)

- replacing non-stops with stops (John -> don) - cluster simplification (spoon -> boon) - replacing consonants so they are harmonious in place of articulation (sock-> gock) - changing voicing patterns (initial stop always voiced, final always voiceless) - final consonants are deleted

at what point in development is there usually an explosion of vocabulary for children?

18 months

true or false: children take a while to speak because they are held back by their inability to distinguish adult sounds

false

true or false: children learn to speak late because the relevant muscles of their vocal tracts are not yet strong enough (will be after about a year)

true - their challenge is in coordination and control of speech gestures

why are stops easy for children to master?

they require little motor control, the tongue or lip just has to move to touch an opposing surface

why are vowels easy for children to master?

they require little motor control, and each vowel has a relatively wide range of acceptable vocal tract shapes

why are consonant-vowel alternations easy for children to master?

they are just a repeated sequence of opening and closing gestures

why are fricatives and approximants harder for children to master?

the tongue or lip has to be very precisely positioned to form a passageway narrow enough for turbulence but not too narrow

why are the sounds l and r particularly hard for children to master?

they require precision AND require different parts of the tongue to make separate closures - at first children only use the tongue as one mass

true or false: the sounds of one's language aren't actually simpler to produce or distinguish, the native speaker is just more used to them

true

true or false: it wasn't until the 1980's that psychologists and linguists started doing systematic acoustic studies of early speech

false - the 1970's`

true or false: new speech skills are mastered by kids instantaneously

false - it was believed to be the case because the gradual learning of children is too small for adults to hear

Macken and Barton 1980 - 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured what are the expected VOT results for adults to produce?

word-initial voiced stops have a small positive VOT, voiceless stops are aspirated with a large positive VOT

what is VOT?

the time interval from the release of a stop to the onset of voicing

Macken and Barton 1980 - 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured true or false: in early sessions kids generally had negative VOT for both voiced and voiceless stops

false - they had small positive VOT for both voiceless and voiced - it was a voiceless unaspirated stop often misheard by transcribers as voiced

Macken and Barton 1980 - 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured true or false: over the course of the study, the difference in VOT between voiced and voiceless grew, mainly through an increase in the VOT of the voiced class

false - an increase in VOT of the voiceless class

Macken and Barton 1980 - 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured why were the changes in kids VOT previously seen as instantaneous?

the difference was too small for adults to hear

Macken and Barton 1980 - 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured at what point could the adults tell the difference between the children's productions?

when their VOT met up with the adults average VOT for that sound

Macken and Barton 1980 - 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured true or false: this was one of the first studies to show a the gradualness of acquiring phonetic mastery

true

Macken and Barton 1980 - 4 children followed over 8 months starting at 1.5 years old meeting every 2 weeks and recorded them playing and answering questions - word-initial obstruent stops (b, d, g, p, t, k) were extracted and their VOT was measured at what age did they find that the difference in voiced and voiceless stops is acquired?

around 2 years

by what age have children generally mastered almost all the sounds of their language?

5 years old

true or false: children acquire language with instruction

false - no instruction

true or false: the earlier one is exposed to a language, the more likely it is they will attain a native level knowledge of it

true

what makes up a foreign accent?

a pattern of pronunciation of a language by someone who applies the habits of their L1 to the speaking of L2

what is an early bilingual?

someone who learned both languages early enough to be a native speaker of both

what is a late bilingual?

someone who acquired more than one language late enough not to be a native speaker of it/them

by what age can bilinguals be exposed to an L2 and speak it without a detectable foreign accent?

age 6

true or false: with first exposure to a language after 12, the speaker will generally have a foreign accent in the L2

false - age 13

true or false: learners can lessen their L1 habits with L2 even when regularly being exposed to it

false - they are less likely to shake those habits if they have regular exposure to L1

what is being referenced when saying the "age of first exposure" to a language?

when the person moved to where the L2 is spoken, NOT when they started classes

true or false: after age 6 it is no possible to become fluent in another language

false - it is likely you will have an accent but fluency is possible with practice

what is code-switching?

when a bilingual switches from one language to another under the control of the speaker

when a fluent bilingual code-switches, the languages become intertwined, rules and patterns often shifting to the other language

false - the languages are autonomous and separable

what is cross-linguistic priming?

exposure to an item in one language facilitates processing of a related item in another language

Kim et al. 1997 - two groups of speakers: early and late bilinguals both asked to describe to themselves silently typical events in their lives in L1 and L2 with fMRI monitoring the brain activity true or false: there was greater overlap in the areas of activity in early bilinguals than late

true

Kim et al. 1997 - two groups of speakers: early and late bilinguals both asked to describe to themselves silently typical events in their lives in L1 and L2 with fMRI monitoring the brain activity what are the results found in reference to the prime areas of speech processing?

those areas are occupied early in life and are not available for learning languages later, late L2 acquirers have to use brain areas away from the L1 centers (on the scan early bil. had two colors overlapping greatly, late bil. had colors completely separate and next to each other

what is the transfer effect?

when the deeply entrenched set of automatic habits for L1 are applied onto F2

true or false: unfamiliar sounds in L2 are replaced with familiar sounds of L1

true - systematically replaced

most dialects in Spanish don't have a distinction between what?

tense and lax vowels

when a Spanish speaker is speaking English, one replacement strategy might be to switch which vowels for which?

the lax vowels of English with the tense vowels of Spanish

Spanish speakers of English are likely to replace English diphthongs with what?

their closest equivilent monophthongs [e] and [o]

the most common English vowel [ə] is often replaced by Spanish speakers with what?

[a] - only Spanish central vowel

in Spanish voiceless plosives p, t, and k are what?

unaspirated in all positions

when a Spanish speaker speaks English they will generally replace the voiceless plosives at the syllable-initial position with what?

unaspirated plosives

Spanish speakers tend to replace the English r with what?

[ɾ] the alveolar tap

Spanish speakers tend to replace English voiced fricatives with what?

voiceless Spanish ones

true or false: Spanish speakers often do consonant deletion or vowel insertion in English words with three consonants at the onset

true - Spanish can only have at most 2 consonants there

true or false: Spanish speakers often do consonant deletion when speaking English words that have 3 consonants in the final position

true - Spanish only allows for at most 2 consonants there

true or false: everyone starts replacing L2 sounds with L1 equivalents

true

why do different speakers of an L2 have different accents?

they vary in where they are in the learning process and at what age they began

Fledge 1991 - compared VOT in word-initial /t/ among Spanish speakers, Spanish learners of English, and English monolinguals true or false: when speaking Spanish, Spanish speaking mono., early bilinguals, and late bilinguals all had VOT values in the same range

true

Fledge 1991 - compared VOT in word-initial /t/ among Spanish speakers, Spanish learners of English, and English monolinguals true or false: late monolinguals (learned English after age 6) produced /t/ with a VOT in between that for Spanish and English

true

Fledge 1991 - compared VOT in word-initial /t/ among Spanish speakers, Spanish learners of English, and English monolinguals true or false: when speaking English, early Spanish-English bilinguals had the same VOT values as English monolinguals

true

if a German speaker were to speak English what is a replacement at the end of a word they would make due to their L1 being German?

German has no voiced final obstruents so they would replace the voiced final obstruents of English with their voiceless counterparts (ex: Bob -> Bop)

German has no dental fricatives, so they replace English ones with what?

stops or affricates

what is the main factor of a foreign accent?

the sounds in L2 that have no counterpart in L1 will tend to be replaced by the closest sound in L1

besides replacing sounds with similar ones, L2 speakers often will replace what?

replace any sound that occurs in both L1 and L2 if it is occurring in a position in which it couldn't occur in L1

why is a foreign accent so persistent?

over our lifetimes we learn processes of production and perception that become automatic which allows us to speak and keep up quickly - mastering these skills becomes a liability when learning another language because of how automatic and engrained they are

true or false: consciously realizing that a similar sound in L1 and L2 is actually different can change the unconscious automatic process of producing and perceiving it

false - it does NOT change the unconscious process

why do early bilinguals not have a foreign accent?

they are exposed to both languages early enough to build separate categories for each and have no trouble keeping them separate

true or false: the hardest sounds for L2 learners in the longrun are the new sounds unlike what they've heard before

false - the hardest are the "false friends" that are close to those in L1 but not the same

why are the most similar sounds hardest to master?

because they are similar enough to ones we already know that we subconsciously believe it is ok to just replace them with English ones

Flege and Hillenbrand 1984 - speakers of varying French knowledge produced French sentences including words "tous" and "tu" - [y] in "tu" has no English counterpart but [u] in "tous" is close - native French speakers had to identify which word their heard true or false: for the group with the lease French experience, their "tu" was much more easily identifiable than their "tous" by the French speakers meaning they pronounced it better

true

Flege and Hillenbrand 1984 - speakers of varying French knowledge produced French sentences including words "tous" and "tu" - [y] in "tu" has no English counterpart but [u] in "tous" is close - native French speakers had to identify which word their heard true or false: there was no significant difference in the ability of the French speakers to identify the "tu" of the most experienced French speakers and the least experienced

true - they were equally good at pronouncing the newer/weirder sound

Flege and Hillenbrand 1984 - speakers of varying French knowledge produced French sentences including words "tous" and "tu" - [y] in "tu" has no English counterpart but [u] in "tous" is close - native French speakers had to identify which word their heard true or false: the results were that the non-native speakers got closer in F2 values to the native French for the familiar [u] than for the new [y]

false - they were closer for the newer [y]

what is speech technology?

any interface between humans and computers involving speech

what is speech recognition?

automatic identification of spoken words

what is speaker recognition?

automatic identification of the person who spoke

what is speech synthesis?

the production of speech by machines

why do companies want to use more speech recognition?

the more they can automate customer service and sales, the fewer the human employees it needs to pay

true or false: humans are more comfortable typing than speaking so they generally prefer a typing interface to a speech one

false - they prefer speaking to typing

why does the government want to invest in speech recognition?

they want an automatic method for filtering recorded speech to locate particular references or voices

how does speech recognition work?

digitized recordings (numerical version of a spectrogram) of speech samples are stored in memory, each is labeled to identify what is said, when a new word is said it is digitized too and compared to all the samples in the memory point by point, the program selects the soundfile in memory with the smallest summed difference from the new soundfile

in a quantized spectrogram, what do the lighter vs. darker colors represent?

the lighter are lower amplitude, the darker are higher amplitude

what is the challenge of alignment in speech recognition?

its hard to choose which point to use as the memory sample because even two productions of the same word by the same speaker won't sound the same

how is the challenge of alignment between two files solved?

expanding or contrasting the timescale to find the best match (time warping)

what is the challenge of segmentation for speech recognition?

people don't pause between words so it's not clear what interval in the soundfile needs to match to the file in memory

how is the challenge of segmentation solved in speech recognition?

they push for one word answers or they have to try different segmentations and check which is the best fit

what is the problem of vocabulary size for speech recognition?

the larger the vocabulary the system has to keep in memory, the more words it has to search through, the longer it takes and will be more likely to have incorrect responses

what is the challenge of variability in speech recognition?

any given word is pronounced differently by different speakers - early dictation programs were speaker-dependent, in order to be speaker-independent it must have a huge variety of speakers

in order to be effective a speech recognition program needs to be what?

adaptive

do successful speech recognition programs use top-down or bottom-up processing?

top-down processing

true or false: the challenges for speech tech are the same challenges face by human listeners

true - segmentation, variability due to speaker, variability due to context, and speech errors

true or false: a program can distinguish between two voices that it's never heard before

false - it can't

how does speech synthesis work?

from a digital recording, sound is converted into a series of numbers representing amplitude at each instant in each freq. band

why do speech syntheses not sound like humans?

they typically get the intonation wrong, and do not accurately mimic the effects of coarticulation

Exam 3 Flashcards

(266 cards)