Lecture 8 - Categorical Perception and Learning Flashcards
Statistical Learning
Through mere exposure, we seem to learn
what kinds of things go with other kinds of things.
we do learn contingencies over time
lines start to blur: associative or non-associative
Through perceptual learning, we seem to BUILD and STORE
specific stimulus distinctions.
• These stimulus features can be used to identify and
categorize different types of things.
- Once established, these feature categories become the basis for top-down perceptual processing (e.g. recognizing feathers on male and female chicks).
- have a store house of different objects and can filter than down to the environment
Example: Perceiving breaks between words
The segmentation problem
how do you find the breaks where theres always a continuous signal?
- there are no physical breaks in the continuous acoustic signal of speech. [High computational complexity]
– Top-down processing, including knowledge a listener has about a language,
affects perception of the incoming speech stimulus (parse the speech as it’s coming in).
– Segmentation is affected by context, meaning, and our knowledge of word structure.
non-associative learning
helps us see how we respond to and distinguish stimuli inn the environment and our responses
perceptual learning - we become better and better at telling things apart
associative learning
find different contingencies (two different stimuli - classical) (response and outcome - operant )
just building contingencies between two things (learning language)
What kind of learning reviewed so far seems specifically
useful for speech segmentation?
Statistical learning
helps us know when the breaks are coming: knowing the probabilities of when certain syllables tend to follow other syllables
Saffran, Aslin & Newport (1996)
demonstrated that
infants can detect word boundaries with
different transitional probabilities. [innate tendency]
- we have the innate ability to track different contingencies
• A continuous stream of sounds becomes segmented.
…bidakupadotigolabubidakutupiro…
…bidaku/padoti/golabu/bidaku/tupiro…
• And this should apply to natural speech.
…lookattheprettybaby…
…look/at/the/pretty/baby…
High likelihood PRE–>TTY
High likelihood BA –> BY
Low likelihood TTY–>BA
Perceiving features
In order to track probabilities, we need to first distinguish
basic features (e.g. syllables) of the stimulus.
have to be able to ID syllables and be able to build those categories up
Some feature detection seems to be innate.
contraints
• Frogs have ‘bug’ detectors: group of cells that detect the size and shape and movement pattern of bugs that induces them to flick out their tongues (Lettvin et al., 1959).
• Visual system has simple and complex edge detectors: straight lines, edges: occur as early as you can train the system
(Hubel & Wiesel, 1959, 1962).
• Babies have phonetic discrimination for all language
sounds up to 10 months of age.
But all of these feature detectors seem to be shaped by both experience and ‘topdown’ influences.
we have all these innate abilities to detect things in the environment but we can shape them with topdown knowledge
- experience dependent plasticity
- Critical periods (e.g. phonetic discrimination)
- Mere exposure and discrimination training
- we can form many many different types of representations
How do we (as babies) initially discriminate the different
phonemes (speech sounds) that make up syllables?
Acoustic Speech Waveform | V Phonemes [d] [da] [di] [du] Words Don dean dune
babies can make discrimination from the acoustic signals that make up syllables
we pull out phonemes (smallest perceived sound from a sound signal)
phonemes can be attached to vowel sounds which creates a syllable and those syllables create words
Sound spectrograms
are often used to show changes in frequency
and intensity for speech.
– These are plotted by frequency (and amplitude) over time.
– Formants are the enhanced
(darker) bands of frequencies.
Consonants
are produced by a constriction of
the vocal tract (using the articulators).
Formant transitions
rapid changes in frequency preceding or following
consonants as you’re producing a sound
when you produce a “duh” or “buh”
This results in production of the basic unit of
speech sound – the phone.
phone
speech signal
the basic unit of
speech sound
phoneme
thing you understand
smallest unit of perceived speech stimulus that changes meaning of a word (bad vs pad). These are defined by your language.
if you change the phoneme you’re changing the meaning of the word that it’s attached to
The variability problem
there is no simple
correspondence between the acoustic signal (phones) and perceived phonemes.
- no one thing in the signal that you can “key in on”
Perceiving features in speech… is hard
Variability from context:
the acoustic signal associated with a phoneme
varies with acoustic context.
what the phoneme or phone is being attached to
coarticulation
Coarticulation:
overlap between
articulation of neighboring
phonemes causes variation in formant transitions. Yet, we still perceive the same /d/.
while you’re articulation one phone, it’s attached to other phones and you’re trying to articulate that next phone as well
articulating all those things, all together at once
you’re always paring that acoustic info with other acoustic info
Variability from different
speakers
– Speakers differ in pitch,
accent, speed in speaking, and pronunciation.
– This acoustic signal must be
transformed into familiar
phonemes and words.
How?
One way we deal with the
variability problem is through
categorical perception.
(one of the ways)
it leads us through the valley
– This occurs when a continuum of stimulus energies ( a lot of acoustic signals coming out at you) are perceived as a limited number of sound categories (you don’t hear a continuous stimulus, it’s broken down).
– This can be accomplished through the use of acoustic cues (sets different syllables and phonemes apart).
acoustic cue
example
– An example of this comes from experiments on voice onset time (VOT): time delay between when a sound starts and when voicing (vocal cord vibrating) begins.
• Stimuli are /ba/ (short VOT)
and /pa/ (long VOT)
CogLab #40
VOT
You (n = 224) heard 9 different synthetic speech stimuli with a range of VOTs from short (0 ms) to long (80 ms).
• Task: What do you hear? (pa
or ba – identification).
- dependent on the critical period: 10-12 months of exposure to these phonemes
- Thus, we experience perceptual constancy for the phonemes within a given range of VOT.
Perhaps, as babies, we perceive basic speech information by*:
- Using innate (species-specific) perceptual abilities to identify phones by acoustic cues (e.g. VOT).
- Relying on mere exposure to allow these categories to become (and remain) clear.
- Once we have those categories we can track which sounds go together to form words using statistical learning.
- Later, improving performance when speaking using discrimination training (with operant conditioning).
- highly dependent on the environment: feedback that helps train the system
phonetic boundary.
As you increase VOT, listeners do not hear the incremental changes. Instead
they hear a sudden change from /da/ to /ta/
if they’re on other sides of phonetic boundary then you hear two different things
great constraint
Is there a theoretical model that shows how this might be done (and is biologically plausible)?
McClelland & Rummhart (1981) ‘s Interactive Activation Model
developed a connectionist
model which may account for
some patterns in language
learning.
- Originally developed for printed language (but can be used for acoustics as well).
- Start off: Feature detectors are activated when they match the stimulus. (Note: they can be spatially sensitive.)
- sensitive to a certain line of a certain orientation: if it's part of a letter that letter node becomes active ( T ) - excitatory connection excites a T (activate the "T" words) - inhibitory connection: L: we're not an L (don't activate "L's"
- They excite letter nodes when the detected feature is part of the represented object (otherwise inhibit).
- Letter nodes excite word nodes if they are a part of the word representation (otherwise inhibit).
- All letter stimuli are evaluated individually.
• R and K are equally likely letters
in the fourth position, based
purely on features. The D doesn’t
match at the feature level.
all the letter part of WORK: activate those nodes because they’re highly likely (we can track that)
individual letters are primed or pre-activated
• The “WORK” node is already
activated and sends feedback to K to pre-activate it (priming?).
• We might explain this
behaviorally, noting that R has a low probability of following R.