Multisensory integration Flashcards
What is the content for multisensory integration
- How do different sensory modalities interact to alter what we perceive?
- What are the principles that govern interactions between modalities?
Here are some of the concepts we will cover:
- Initial sensory processing as modular followed by subsequent stages of integration
- Examples of perceptual dominance / ‘capture’ between sensory modalities, and initial thinking about temporal and spatial rules to describe how sensory signals are selected
- Combination of visual and auditory signals (e.g. McGurk effect)
- Sensory signals as probabilistic cues to the properties of the world and the prevalence of noise
- Statistically optimal inference using Maximal Likelihood Estimation: linear weighted combination and minimising variance
- Prior knowledge and the Bayesian framework for perception, cognition and action
Multisensory processing helps improve perceptual estimates significantly because (2)
because (i) information from individual sensory modalities can be ambiguous by itself, but it is constrained by information from another modality (Sekuler et al’s 1997 bouncing ball illusion is a nice illustration); (ii) all sensory processing is subject to noise at multiple levels from the stimulus upwards (e.g., photon noise, phosphenes / tinnitus, the stochastic nature of action potentials and the potential to loose information in the dynamic processing of the brain). This noise guarantees that our sensory estimates of the world are imprecise and can lead to inaccuracies. This increases uncertainty and variability at higher cognitive levels. The impact of this imprecision and uncertainty will vary by task - but it affects diverse behaviours from hitting a tennis ball to driving a car.
Which experiment prove cross-modal perception developed in early childhood? -Spelke, 1976 Cognitive Psychology
Research (Spelke, 1976 Cognitive Psychology) has shown that the ability to match perceptual information across the senses is present from early childhood. Infants seem to be able connect visual and auditory information, from soon after birth. In one study, 4-month-old infants were sat in front of two screens, each screen showing different films at the same time. One film was of a woman playing “peek-a-boo”, and the other was of a baton hitting a wooden block. Meanwhile, the soundtrack from one film was played from a loudspeaker positioned between the two screens. The researchers found that the infant preferred to look at the screen that matched the soundtrack. This preference for congruence across modalities is also found with videos of faces when the soundtrack is played out of sync. The infants become fussy and prefer to look at the video where the mouth movements are in sync with the soundtrack. If you have ever watched a film or TV where the speech and soundtrack are out of sync, then you know this preference persists into adulthood.
Explain- ‘Initial sensory processing as modular followed by subsequent stages of integration ‘
Mirroring the history of perceptual investigation, and our basic intuitions (e.g., “sight is different from sound”), we conceive of sensory processing as initially modular (i.e. separate ‘windows’) followed by subsequent stages of combination. This modularity has significant computational benefits in that the processing noise from one sensory modality is independent of other modalities. This means that noise can cancel out, making perceptual estimates better. (If the noise isn’t independent, we gain much less from multisensory processing). The basic organisation of the cerebral cortex suggests a division of labour (e.g. visual cortex, auditory cortex, somatosensory, etc) and modularity. [It is worth noting that there is considerable evidence for multisensory processing within individual key sensory areas (Ghazanfar & Schroeder, 2006 have a provocative paper on this topic)]. Nevertheless, initial sensory processing is broadly modular.
Examples of one perception are dominant compared to another.
The are several examples of situations in which vision will dominate auditory signals (e.g. ventriloquist effect that relies on viewers inferring the source of the sound from the movements of the puppets mouth, [not ventriloquists ‘throwing’ their voices]), and others where sound dominates vision (e.g., your perception of the number of visual events is dominated by the number of auditory events, such as in the beeps and flashes of Shams et al, 2000).
One widely held idea is that different sensory modalities have different dominances because they are intrinsically better at signalling certain types of information. Specifically, sound has much better time resolution than vision, so it should therefore dominate tasks with a fine temporal component. By contrast, vision has much better spatial resolution (in the order of ~100:1), and thus should ‘capture’ spatial tasks. This notion has intuitive appeal, however, as a general computational solution it is not robust. Moreover, it simply doesn’t make much computational sense as it involves throwing out one source of information. There are circumstances in which it is sensible to do that, but generally the brain is much better served by using all the information it has at its disposal.
What does McGurk effect indicate?
So in general, visual and auditory information combine to influence perception. A good example is the McGurk effect, where a film of a person visually saying /ga/ with a soundtrack of a person saying /ba/ is heard as /da/. Thus, participants perceive the most likely perceptual interpretation of the event in the world as something that does not correspond to either sensory modality in isolation.
Explain why combination of multisensor minimise variability?
If this seems a little complicated, the basic idea is this: if you have two pieces of information, use them both, but trust the more reliable one more.
We can characterise the reliability in a modality by the spread of responses we get if we were to repeat a stimulus many times. This is referred to as the variance. In fact, we typically use a Gaussian model (i.e., a Normal Distribution) to describe sensory estimates. This is mathematically convenient (there are only two parameters: the mean and the variance) but is also quite reasonable as a Gaussian is generally useful in modelling the sum of many independent stochastic processes (e.g., neural signals).
Given two (independent) Gaussian distributions (one for each modality) for the estimate of a particular property (e.g., the azimuth of a target), the statistically optimal thing to do is to combine all the information. We do this by multiplying the two distributions together. Initially, this has a complex mathematical form, but it turns out that it reduces to something quite simple. The resultant estimate is a Gaussian that is centred between the two component estimates and has lower variance (i.e. higher reliability) than either modality alone. In particular, the mean of this combined estimator is a weighted average of the means of the individual cues, where the weights are determined by the reliability of each cue. Graphically, we can see that the combined distribution (centre) is shifted closer to the more reliable individual cue. Moreover, the combined distribution has a higher peak (i.e., is less variable) than either of the individual components.
What is Maximum Likelihood Estimation (MLE) model
This Maximum Likelihood Estimation (MLE) model provides some principled rules for determining how that basic intuition might be implemented. In particular, it has nice features in relation to Dominance vs. Combination phenomena. Specifically, dominance is a situation in which one cue is much more reliable than the other (e.g. bottom illustration in the “Reliability affects the mean” slide above). In this case, the mean of the combined estimate is very close to the mean of one of the cues. Experimentally, it would be very difficult to separate this combined estimate from the single component - giving the appearance of dominance (even though signals are actually being combined). Thus, a single process can account for a range of different previously observed experimental results (from capture to combination).
Experimental Evidence for the maximum likelihood estimation model-Ernst and Banks (2002, Nature)
The classic study on multisensory cue integration was conducted by Ernst and Banks (2002, Nature). They studied the perception of object size from visual and haptic (i.e., felt with the hands) cues. Their task was akin to judging the thickness of a telephone directory based on (1) looking at it; (2) grasping it with the thumb and forefinger, or (3) looking and grasping at the same time.
They measured how well participants judged size by having them compare a standard object against a range of comparison objects that varied in size from smaller than the standard to … larger than the standard. They measured the reliabilities for judging size - i.e., how much bigger/smaller the comparison object has to be before participants could spot the difference. Participants were better (i.e. could spot a smaller change in size) for visual cues than haptic cues, and crucially performance with vision+haptics together was better than either of the single cues alone.
They then introduced a slight discrepancy between the visual and haptic sizes. Under normal circumstances, judged size is closer to the visual cue in this conflicting case because the visual modality is more reliable. However, by changing the visual display by adding visual ‘noise’ (they randomised the positions of some of the elements that defined the object’s size) they could reduce the reliability of the participant’s judgments based on vision alone. According to the MLE model, this should reduce the weight given to the visual cue. Indeed, they showed that as the reliability of the visual cue was reduced, participants moved to give more weight to the haptic size. i.e., Weights changed from visual dominance when there was no noise added (so visual information was very reliable) to haptic dominance when there was a lot of added visual noise. That is, behaviour changed smoothly from ‘visual capture’ to ‘haptic capture’.
Many studies provide compelling evidence that the brain uses a process very much like the ‘statistically optimal’ thing to do. The basic paradigm has been replicated many times over the past decade, confirming the finding for different cue combinations: within sensory modalities and between modalities (e.g., combinations of visual, touch and audition for a range of different tasks).
Explain Bayes rule
The underlying insight from the Bayesian framework is that when making an inference (e.g., listening to speech in noise), it’s sensible to use previous knowledge of the statistical likelihood of experiencing a word in that position of a sentence. A classic example of this is the interpretation of the phrase ‘Forkandles’. This acoustic information is ambiguous: if you don’t know the context you don’t know if it is meant to be ‘Fork Handles’ or ‘Four Candles’. Nevertheless, humans typically interpret speech by assuming the context from past experience.
Using the Bayesian framework, we can think about prior information in exactly the same way as we thought about individual sensory cues. It will have a probability distribution and can be combined with sensory data using the same rules we discussed for integrating different sensory modalities (i.e. multiplying probability distributions). This strategy is the statistically best thing to do: it means that we have the best chance of working out what is happening in the world around us. And that is what perception is all about.