Week 8 Speech + Music Flashcards
Vocalisation
- = acoustic energy arising from vocal tract
1. Air pushed from lungs provides molecular disturbance
2. Air crosses vocal folds
o Vocal folds longer in men
o Folds lose tautness with age
3. Resonates through larger cavities
4. Articulated by throat, tongue, lips, teeth, jaw
o Hundreds of fine movements per second
o Also goes through nasal cavity – resonance cavity
o Using all the muscles to form that air into a particular spatial pattern and
push it out the mouth
o Face is over-represented in somatotopic map
Speech perception requires
- Hearing o Functional auditory system o Basic 3 levels of auditory processing Spatial localisation Signal to noise optimising Recognition of sound as vocal energy Gives you identifiable sound - Speech processing – gives you meaning o Semantic content o Paralinguistic information – gives extra info beyond just the words Person speaking Affective state Get affective mood information Intentions Questions, monotone Conveyed by fluctuations
Non-speech vs speech vocalisation
- Non-speech o Screaming, laughing, bawling, grunting o No words o Things we see in other species that can vocalise - Speech o Words o Semantics + prosody o Only in humans
Phonemes
Semantic content
= distinct sounds used to create words in spoken language
- The fundamental unit of speech; if you change a phoneme, you can change the
meaning of a word
- Written /b/ etc.
- Every language has its own phonemes
- ~ 12 phonemes/sec at normal rate of speech
- Each phoneme has a very specific pattern of acoustic energy
o Specific to phoneme. Specific to individual
o Formants – the frequencies at which peaks of acoustic energy occur
Each vowel related phoneme has a characteristic formant pattern
Consonants provide formant transitions – rapid shifts in frequency
- A string of phonemes (a spoken word) gives a specific sound spectrogram
o Specific to word, specific to individual
o Temporal time scale – say it fast and process it fast
o Every person has a different spectrogram for the same word
- Tech applications based on this specificity
o If every individual says a word in a specific pattern of energy that someone
can’t fake – can use this as a security device
Voice recognition for security
Way you say your name is as unique as your fingerprint
Sounds the same as someone else but acoustic energy will be different
Speech recognition for communication with machines/computer interface
Talk to a computer interface of some sort
Train computer to recognise words based on acoustic energy patterns
associated with them
Problem
Computer only knows a certain number of patterns so doesn’t always recognise what you’re saying
Your pattern might not match one embedded in they system
Put as many patterns as possible in the system to compensate
Sound spectrograms
- Illustrate phoneme spectra, cadence, intonation
- Small time scale
o Speech has a fine temporal structure
o There are rapid fluctuations in acoustic energy - How do we make sense of the rapid acoustic energy
o The brain can bind phonemes and parse streams of acoustic energy into words - ~5% children perceptually speech impaired (LBLI – language based learning impairment)
o Brain isn’t able to grasp concept of binding and parsing
o Taking longer in perceptual processing to understand temporal structure of speech – talking slower helps
Bottom Up Processing
maybe each phoneme is indicated by a specific fibre activation pattern (pattern encoding)
- Level: auditory fibres in cochlea
o Tonotopical basilar membrane in cochlea
o Maybe each spectrogram maps to a specific ‘neurogram’ o High frequency at base, low frequency at apex
Hairs have frequency associated with it that they best respond to
o Asyousayaword
Each phoneme has a different amount of energy
Maps straight from map of cochlea and travels straight up stream
Different pattern of activation for each sound – brain puts it together
Shortcomings (Speech processing – content)
o Coarticulation
Sometimes we say a phoneme but the energy is different depending
on what comes after it
/d/ will activate different fibres in cochlea basal membrane in different
situations – different words
Acoustic energy is different but resulting perception is the same
E.g. /di/ and /du/ have different spectrograms for the /d/, but in both
instance we hear /d/
o Phoneme variance
/t/ from different people have different spectrograms but we
understand each other’s /t/
/t/ at 25dB and /t/ at 60dB have different spectrograms but both are
understood as /t/ - as a whisper or loud
Resulting perception is the same
This indicates perceptual constancy is occurring
This occurs on the word and sentence levels
The stimuli are different but resulting perceptions are the
same
E.g. ‘the’ – 50 different spectrograms – understood as ‘the’
Perceptual constancy (Speech processing – content)
o Different incoming stimuli result in the same perceptual interpretation o Examples of perceptual constancy
Phoneme constancy
Speech constancy
o Suggests a lot of top-down processing is occurring in our understanding of
speech – built a knowledge base and rules about patterns to help perceive speech
Cortical , Top down Processing
speech perception needs some incoming bottom-up sensory information but cortical processing is critical
- Via experience, build a knowledge base to assist in interpreting and understanding sensory info coming in from the environment
o Apply knowledge base to the signal to make sense of it
- Level: auditory cortex
o A1 + secondary/tertiary bands
o A1 does not preferentially respond to speech, responds to any noise o Some association areas do preferentially respond to speech
- Wernicke’s area
o In left hemisphere
o Critical for hearing words
o Damage = intact hearing but can’t understand speech
Receptive aphasia – can hear signal but not the semantics, has no meaning associated
- Broca’s area
o In left hemisphere
o Critical for speaking words
o Damage = intact hearing but can’t speak coherent words
Expressive aphasia
Can understand what is spoken but can’t put sounds together into
meaningful words
Paralinguistic info
- R hemisphere
o Areas of temporal lobe, prefrontal cortex, limbic
o Strong activation to emotional intonation and non-speech utterances o Strong activation during speaker identification
o Who is talking, are they in a good mood, are they asking questions? - Prosody – intonation
o Signals intent – declarative/interrogative o Signals mood - Damage
o Dysprosody = content intact but can’t understand intonation o Phonagnosia = content intact but can’t identify speaker
o Autism – problems in superior temporal sulcus processing
Understand speech fine but miss nuances – sarcasm, comedy, anger
Intonation based, not content based
Speech perception requires
- Hearing
o Functional auditory system–pinna-A1
o Basic 3 levels of auditory processing – A1, parabelts, ventral/dorsal stream - Speech content
o Content – Wernicke’s, Broca’s – LH
o Paralinguistic info – person, affective state, intentions
If you don’t have these – miss subtleties of human interaction
RH laden
STS, PFC, limbic system
Phoneme recognition
speech processing
o Acoustic energy pattern
o Phonemic restoration effect
Do need acoustic energy pattern coming in – although the pattern may be missing some of the phonemes you heard
Use knowledge to fill in missing phonemes
People might not pronounce all the phonemes in a word – brain fills in gaps – restoration gap
o Indexical characteristics
Knowledge about the person
Their accent – using that info to help understand what they’re saying
Context
speech recognition
o Language appropriate combinations of letters
o Topic of conversation
Use context to help figure out what they’re saying
Boundaries
speech recognition
o Language appropriate combinations of syllables
There are no boundaries between the words you speak but you hear
separate words
The breaks are an auditory illusion – acoustic energy is continuous but
you can parse the words
Need to parse in the right places to separate the info into meaningful
words
E.g. so I got out of bed this morning VS so I got out of bed this morn ing
o Knowledge of vocabulary
Need knowledge of English and vocabulary
We perceptually hear separate words based on knowledge of
language
The only time we usually acoustically break is to signal ‘done’ in flow of conversation
It sounds weird when people actually pause between words