Speech Perception Flashcards
stages of speech processing
speech perception+speech comprehension
decode->segment->recognise->integrate
decode:
auditory input->
select speech from acoustic background+transform to abstract representations
segment: word recognition -activation of lexical candidates -competition -retrieval of lexical information
recognition:
utterance intepretation
-syntactic analysis
-thematic processing
integration:
into discourse model
phonetics
physical properties: loudness, duration, pitch
speech signal is distributed over time: rapidly changing, fast-fading
the speech organs
air flows from lungs, through vocal tract, out of mouth, nose
pressure fluctuations modulated by shape, constriction of vocal tract->sound waves
sources of sound:
-larynx: regular, periodic vibration of vocal folds
-> characteristics pitch
constriction of vocal tract by lips, palate, tongue
->phonemes
phonetics of vowels
mouth as a vibrating chamber
vowels: depend on position of tongue, especially up/down, front/back
- >changes shape of resonating chamber
phonetics of consonants
created by constricting the vocal tract in different ways
phoneme perceived depends on:
- place of articulation
- manner of articulation
- voicing (vibration of voice box)]
speech spectrogram
loudness of all frequencies over time (frequency x time)
third dimension: formants
- prominent resonances
- specific frequencies amplified by the shape of mouth
formant transitions
- formant transitions over time due to constrictions of vocal tract produces different consonant phonemes
- change in relationship between 2nd and 3rd formant due to place of articulation for stop consonants (e.g. b, d, g)
challenges in speech perception
- segmentation problem
2. variability problem
problem 1: speech is not segmented in separate words
written words have spaces between them to show boundaries
listening to native language in clear conditions we identify words despite ambiguity, but difficulty revealed when:
- listening to foreign language
- misunderstood song lyrics “mondegreens”
segmentation is a major problem for understanding unfamiliar language and for automatic speech recognition systems
-> how do listeners overcome the ambiguity of the continuous speech stream in familiar language?
“mondegreens”: why are misperceptions particular common for song lyrics?
- two signals: music, words
- rhythm of music changes stress, durations
- tune of music changes intonation, durations
- articulation may be imprecise
- pragmatics/semantics of lyrics/poems can be unexpected
problem 2: speech is highly variable
Substantial variability in pronunciation of phonemes, syllables, words both between and within speakers
between speakers:
-gender, accent, language, age all affect acoustic properties of speech
within speakers:
1) linguistic:
- coarticulation: the articulation of the same phoneme can sound different in different words e.g. s in soon vs seen because we’re moving the mouth towards different vowels
e. g. leaf and feel reversed are not each other
2) non-linguistic:
- physical state, emotions
3) paralinguistic:
- speech rate: durations, precision of phonemes differs
- clarity: special effort to reach articulatory targets
effects of audience
adult-directed, infant-directed, pet-directed speech
relationship between 1st and 2nd formants -> defines vowels
vowels much more differentiated in speech to infants (hyperarticulation)
how do people solve the problems
acoustic information (bottom-up):
- categorical perception
- prosody
- lexical stress
information in long-term memory (top-down):
-context effects: phonotactic, lexical, sentence
multi-modal information:
-lip movements
bottom-up processing: categorical perception
Listening to speech, we hear distinct phonemic categories even though the acoustic changes are gradual i.e. the speech signal is perceptually categorised into phonemes despite variability of the actual acoustic signal, especially for consonants
This efficient, automatic categorisation into
phonemes reduces sensitivity to ambiguity
caused by variability
Continuous changes to the 2nd and 3rd formant of synthetic speech yields ‘categorical’ changes in the perception from /ba/ to /da/ to /ga/
prosody (melody, up and downs) and stress
The melody of spoken language
Intonation contour: patterns of rising/falling pitch that help to chunk speech into meaningful units (eg phrases, clauses), convey aspects of meaning
e.g. rising pitch for questions
Rhythm of speech influenced by the
pattern of prominent vs not prominent
syllables (stress)
– In some languages stress is very regular (eg Spanish, Italian) ➔ strong cue for word
segmentation to word boundaries
– More variable in English – typically first
syllable but correlated with grammatical class
e.g. contract: CONtract-noun, conTRACT-verb