Speech Perception & Comprehension Flashcards
SPEECH = VARIABLE
- every word takes dif acoustic shape each time it’s uttered; due to:
1) speaker (vocal track size/regional accent/socio-economical tier)
2) articulation rate (4/5 syllables/sec in sentences)
3) prosody (music of speech ie. rhythm/melody/amplitude)
4) mode (voiced/whispered/creaky)
5) coarticulation (individual phonemes influenced by preceding/upcoming segments ie. regressive/progressive assimilation)
VISUALISING SOUND
- 2 main ways:
1) WAVEFORM - y-axis represents amplitude (w/0 on horizon); x-axis represents time
2) SPECTROGRAM - derived from Fourier transform to represent time on x-axis
- y-axis = frequency/energy (ie. amplitude)
- colour = 3rd dimension (aka. brighter = stronger)
SPEECH = QUASI-CONTINUOUS
- no unique/systematic way to flag word boundaries aka. rarely silence between 2 words
- short silences (100ms) typically correspond to vocal tract closing to produce so-called plosive/STOP consonant in “pocket”
SPEECH = LEXICALLY AMBIGUOUS
- words = made of limited number of sounds/syllables aka. embedded words = everywhere inside other words
- ie. captain -> cap
- ambiguity also arises due to straddling words as soon as we put 2 words together
- ie. clean ocean -> notion
SPEECH = AUDIOVISUAL
- visual info given by lips/adjacent facial areas about articulation = integral to speech perception when available
MCGURK & MCDONAL’S ILLUSION (1976)
- visual signal should be weakly constraining for it to work aka. visual /ga/ = ^ ambiguous > visual /ba/
- /ga/ = don’t actually see if speaker is closing glottis
- so visual cues = also compatible w/ /da/
- visual /ba/ = unambiguous as you see lips closing preventing illusion from occurring
- visual signal must be compatible w/both back/medial closure of vocal track (/ga/ VS /da/); conflict w/front closure implied by auditory /ba/ attracts perception towards mid-point between front/back of mouth (/da/)
FUSION - /ga/ (vision) + /ba/ (audition) = /da/ (perception)
INFO FOR IDENTIFYING WORDS
PHONEMES
SUPRA-PHONEMIC INFO
PHONEMES
- building blocks of vocab
- smallest units in signal allowing meaning distinction (ie. bat/mat have 3 phonemes & differ by 1st one)
- limited number so words are created by combining them in unlimited ways specific to language
- English = 20 vowels & 24 consonants
SUPRA-PHOENEMIC INFO
- prosody/music of speech (ie. rhythm/melody/energy) ie:
1) lexical stress/accentuation (ADmiral/admiRAtion)
2) tones (same strong of phonemes can have dif meanings depending on pitch contour in some languages ie. ma in Mandarin (horse/mother/scold)
SUPRA-PHOENMIC INFO: DAHAN ET AL. (2001)
- carried by larger chunks > phonemes ie. syllables
- languages vary in terms of importance of supra-phonemic info for recognising words (ie. French < English < Mandarin)
- phonemic/prosodic info is needed for lexical distinctions BUT word recognition = also sensitive to subtle articulatory details ie. co-articulation cues
- the way in which vowel is pronounced/sounds depends on identity of following consonant
SPEECH = MENTAL CATEGORIES
- when presented w/exemplars along continuum of syllables between 2 end-points (ie. gi-ki) we perceive whole continuum section as 1 category (ie. gi) while the other is a separate category (ie. ki) despite physical changes in category
- aka. step-like shift indicating category boundary at some point in continuum
- we experience stimulus as either 1 or other BUT not as in-between aka. categorical perception
- most obvious in consonants (ie. rapid acoustic changes) > vowels/tonal info (steadier/continuous)
CATEGORICAL PERCEPTION IN DISCRIMINATION TASKS
- can also occur in discrimination tasks
- hearing dif between 2 adjacent exemplars in continuum is maximal at category boundary (ie. across categories) BUT at chance within category
- category boundary lies roughly at location of continuum for all speakers of given language
CATEGORICAL PERCEPTION IN CONSTNANT CONTRASTS
- cannot be easily demonstrated on all contrasts as you need to identify key parameters involved in contrast & latter must be easily manipulated
- ie. voicing distinction (pa/ba; ga/ka) = regulated by 1 acoustical parameter aka. Voice Onset Time (VOT) corresponding to noisy segment from consonant explosion release up to start of periodicity in vowel
- aka. voiced consonants (b/d/g) = shorter VOT > voiceless counterparts (p/t/k) in English
VOICE ONSET TIME (VOT)
- can be manipulated to create continuum from voiced consonant to voiceless counterpart (ie. gi VS ki) & see if perception follows progression along continuum linearly VS showing mental categories
- pps asked if 2 stimuli adjacent on continuum = same/dif acoustically -> maximal discrimination occurs at perceptual boundary & would be at chance for all other adjacent comparisons
WERKER & TEES (1984)
- examined ability of English infants to discriminate non-native (ie. Hindi/Salish) contrasts during 1st year of life
- cross-sectional/longitudinal approaches using conditioned head-turn paradigm
- newborns come to life equipped to deal w/any possible phonetic contrast
- non-native contrasts disappear w/exposure to language BUT native contrasts = maintained
- aka. infants transform language-general phonetic skills -> language-specific phonological abilities via “winnowing” (aka. narrowing down) initial set of “innate” discrimination abilities
SPEECH -> WORD MAPPING
DIRECTIONALITY OF LEXICAL ACCESS
ACTIVE COMPETITION BETWEEN WORDS
INTERACTIVITY FROM MEMORY -> PERCEPTION (VICE VERSA)
DIRECTIONALITY OF LEXICAL ACCESS
- auditory memories for words “open up” only if initial sound = perceived
- left -> right processing aka. first few sounds carry most of info weight (word endings = easily “guessed”)
- contrasts w/parallel processing in visual/orthographic modality
- importance of Uniqueness Point
ALLOPENNA ET AL. (1998)
- used eye fixations to determine which words was being evoked by incoming speech (“and now click on the beaker”)
- more fixations of onset-overlapping words (ie. beetle) > rhyme-overlapping words (ie. speaker)
- so word memories = more easily evoked by initial sounds > endings (ie. directionality)
MARSLEN-WILSON & WELSH (1978)
THE COHORT MODEL
STEP 1) Activation
- first sound of word activates all words in memory beginning w/said sound aka. cohort
STEP 2) DE-ACTIVATION
- words no longer match signal as it unfolds = progressively rejected from cohort
- initial cohort gets smaller as more info arrives
STEP 3) UNIQUENESS POINT
- aka. word identification just occurred
- info afterwards barely has role in word recognition; system already committed itself
COMPETITON BETWEEN LEXICAL CANDIDATES
- each phoneme in input can logically only belong to 1 word at a time
- so sound-overlapping memories (ie. succeed/seed) = necessarily enemies
INTERACTIVITY
- 2 processes (A/B) said to interact if B receives A output as input & able to send result of computations back to A before A = completed
- similar to “Larsen” acoustical effect
- 2 conditions must be met:
1) info must be allowed to travel both ways
2) info should be passed to next lvl immediately pre current lvl finishes own computation (ie. cascading NOT seriality)
THE GANONG EFFECT (1980)
- lexical content dictating how preceding ambiguous phoneme should be heard = compatible w/notion that perception is no autonomous process but 1 that LT lexical memory influence
- Ganong effect motivated inclusion of top-down connections in TRACE (McClelland & Elman (1986))
- BUT could be that Ganong doesn’t reflect influence of memory on perception per se but simply combination of perception + memory influencing conscious decision 1 needs to take to complete labelling task
SPEECH STREAM -> WORD SEGMENTATION
LEXICAL SOLUTIONS
PRELEXICAL CUES
LEXICAL SOLUTIONS
1) word offset anticipation
2) lateral inhibition between word memories
- solution proposed in Cohort model
- segmentation = by-product of word recognition
- lexical boundaries perceived as consequence of recognising words in speech
LS: WORD OFFSET ANTICIPATION
- useful for words that have their uniqueness point per/on last sound (ie. catheDRALrenovated)
- BUT many cases of short words embedded at start of longer words aka. clearly not ideal (ie. CATerp…/ etic VS illar?)
LS: LATERAL INHIBITION BETWEEN WORD MEMORIES
- if overlapping words = enemies & adjacent words = friends -> segmentation outcome = optimal:
1) elected words don’t overlap by any of their sound (aka. 1-to-1 mapping only)
2) no phonetic segment is left unaccounted for (aka. exhaustivity) - “shipinquiry” = ONLY “ship” + “inquiry” despite other words fully compatible w/portions of signal (ie. in/ink/shipping)
- BUT ship/inquiry = only 2 words w/which conditions 1 & 2 are met
- some memories could have reaction lvl pushed below baseline via lateral inhibition from overlapping competitors; contrasts w/machine parsing “recognise speech”
PRELEXICAL CUES (TRUBETZJOY (1939))
- listeners learn that certain properties of language = associated w/presence/absence of word boundaries
- these cues modulate lexical competition to favour certain segmentation outcomes
- cues can be proximal (located at word boundary)/distal (further away)
- 2 types:
1) allophonic cues
2) rhythmic cues
3) phonotactic cues
ALLOPHONIC CUES
- phonemes take on particular shape/quality depending on position relative to word/syllable boundaries
IE) ENGLISH - voiceless stop consonants = aspirated at syllable/word onset
- vowels can be preceded by glottal stop at word onset (ie. grey t(h)ape VS great (?)ape)
- speakers universally lengthen word-initial sounds (ie. great aaaape VS great tttttape)
- syllables tend to be shorter when part of multisyllabic word compared to being a monosyllabic word in own right
RHYTHMIC CUES (CUTLER & BUTTERFIELD (1992))
- relate to beat of language & relative syllable weight
IE) ENGLISH - listeners take STRONG syllables as word onsets
- reflects that STRONG-weak = most common stress pattern in said language
PHONOTACTIC CUES
- each language has own rules on sequencing sounds inside words/syllables & on which sound occupies which position; need not be all or nothing (aka. restrictions) but based on probabilities (ie. position specific frequencies)
IE) ENGLISH - “th” (ie. that) = always followed by vowel aka. must be word boundary right after “th” when consonant follows (ie. “bathe more”)
IE) FINNISH - vowels = either all of same category (all back OR front) or neutral; changing back -> front (vice versa) in running speech signifies word boundary
SUMMARY
-word recognition = strongly directional; follows unfolding of signal
- lexical competitors = words matching signal from OWN onset; don’t have to be aligned w/word onset
- segmentation can be solved by:
1) reorganising words in signal (lexical solution)
2) learning that some linguistic events correlate w/presence/absence of word boundaries (pre-lexical solution)
- phoneme perception = influenced by words we know (interactivity)