Lectures Flashcards
different types of computational modelling have radically different…
assumptions about the nature of cognition
most forms of computational modelling…
involve some form of simulating a cognitive process
ie. input -> “model” -> behavioural output
models are different on their level of analysis
Marr’s levels:
- neural
- algorithmic
- computational
how does computational modelling aid in understanding human behaviour?
by establishing a concrete definition of a cognitive process
origins of modelling
computer simulations have been popular since early years of psychology
the importance of computation was recognized at an early stage ie. Turing in 1950
Weiner (1948) and Shannon (1949) conducted early mathematical theories of information and communications
Society for Computation in Psychology
Weiner and Shannon
Weiner (1948) and Shannon (1949)
conducted early work in mathematical theories of information and communications
Society for Computation in Psychology
formed in 1971
one of the early subgroups of cognitive psychology
prof is a member
2 types of analytical models
- recognition memory experiment
- signal detection theory
recognition memory experiment
presented with a list of words
presented with pictures of those words
tested for old or new words
sometimes falsely accept things that didn’t occur
signal detection theory
measurement of the difference between two distinct patterns
first pattern is the one you’re supposed to pay attention to
second pattern involves the random noise that distracts a person/machine’s ability to collect and process info
essentially looks at how easy/difficult it is for someone to process info and respond to it when they’re also being exposed to background noise/distractions
the primary model type we’ll look at in this course…
simulation models
output of model isn’t deterministic
underlying randomness in the model (typically implemented with random number generators)
mind as computer
Pylyshyn 1984
mind takes in information from senses
integrates them and creates perceptual experience and behaviour
knowledge acquisition: Plato vs Chomsky
Plato: knowledge must be gained via experience
Chomsky: we are born with innate knowledge and learning mechanisms
poverty of the stimulus
there is no way that we must hear every form of language we produce in order to learn it
we produce more language than we experience
and all possible language is even greater than the language we produce
the difference between ‘language experienced’ and ‘language produced’ is accounted for through…
innate knowledge
possible solution: Simon (1969)
discussing the path taken by an ant on a beach, Simon noted that the ant’s path is “irregular, complex, hard to describe. but its complexity is really a complexity in the surface of the beach, not a complexity in the ant.
big data and natural language processing
collection of large text sources has changed how we think about studying language
possible to propose learning mechanism and train on realistic data
a model can be “born” into a realistic language environment
we then gain insights into cognition and language performance by examining how the model learns and functions
also is a powerful natural language processing tool
T/F: virtual environments are approaching real world complexity levels
true
language learning: bi-directional benefit
we benefit from using large, realistic text sources because we can train models on them
the models give us insight into cognition/language performance/learning
also become powerful natural language processing tools
corpus-driven modelling
identifies strong tendencies for words/grammatical constructions to pattern together in particular ways
while other theoretically possible combos rarely occur
corpus-driven modelling allow for…
connections between lexical experience and lexical behaviour
first corpus ever
Brown corpus of Kucera and Francis
1967
consisted of about 1 million words, sampled from different areas
examples of text-based resources now available for use for corpus-driven modelling
Grade 1-12 textbooks
Scientific journal articles
Newspaper articles
Wikipedia
TV and movie subtitles
Books
Urban dictionary
distributional models of semantics
usage-based model of meaning
based on assumption that statistical distribution of linguistic items in context plays key role in characterizing their semantic behaviour
distributional models build semantic representations by extracting co-occurrences from corpora
internal versus external theories of cognition
internal: involves attending internally to thoughts, memories and mental imagery
external: involves attending to stimuli in the external environment
brain, body, environment
organization of long term memory
long term memory
splits into:
explicit/declarative (conscious) and implicit (unconscious)
explicit/declarative splits into:
semantic (events, experiences) and episodic memory (facts, concepts)
implicit splits into:
priming and procedural memory (skills, tasks)
explicit/declarative memory splits into…
- semantic memory (events, experiences)
- episodic memory (facts, concepts)
implicit memory splits into…
- priming
- procedural memory (skills, tasks)
semantic memory
refers to what you know
events, experiences
how is semantic memory tied to language?
not necessarily tied to language, but intimately connected
language is a general organizing principle of memory
lexical semantic memory
memory of word meanings
study of semantic memory examines…
storage and retrieval
modern theories of semantics
based in experience
environment serves as model/constraints
2 branches of “based in experience” theories of semantics
- grounded/embodied theories
- our perceptual world (and our brains, which are embodied) is used as our main info source to understand the world around us - text-based machine learning
frontal lobe
language processing
emotional regulation
executive functioning
planning
organizing
memory
impulse control
problem solving
selective focus
decision making
behavioural control
temporal lobe
episodic memory
(involved in comprehension, storage and retrieval of memory)
hearing ability
- first area that processes speech info, turns it into a linguistic code
memory acquisition
some visual perceptions
categorization of objects
comprehension
memory retrieval
perisylvian region
area of brain responsible for language
composed of:
- primary auditory cortex
- wernicke’s area
- angular gyrus
- arcuate fasciculus
- primary motor cortex
- broca’s area
wernicke’s area
constructs rep of meaning for linguistic info
damage from stroke to this area = fluent/receptive aphasia
- loss of ability to understand and create meaningful language
- grammatically correct but incorrect meaning
broca’s area
responsible for linguistic production
damage from stroke to this area = non-fluent/productive aphasia
- loss of ability to produce fluent language
- but can still understand language
wernicke’s location
posterior temporal lobe
many connections to primary auditory cortex
heavily connected to Broca’s area
wernicke’s = important for…
storage and retrieval of word representations, meanings, grammar
broca’s location
posterior inferior frontal region
next to primary motor cortex (responsible for muscles used to produce speech)
sometimes called motor speech areea
arcuate fasiculus
connection between Wernicke’s and Broca’s area
important for BOTH phonological and lexical-semantic processing
early theory of semantic memory - devised by Collins & Quillian
hierarchical networks
hierarchical networks
Collins & Quillian
suggest our info in memory is organized hierarchically - can be repped by a tree
- superordinate at the top
- as you continue down the network, get more subordinate info
what kind of info is at the bottom of the tree in hierarchical networks?
actual instances of a category
if information is stored in the brain in the way suggested by hierarchical networks, then there should be a corresponding connection between…
the amount of time it takes you to find connections between these properties
direct connections will be faster
think about it like walking from point to point
living thing: example of hierarchical network
living thing - connects to propositions “is” and “can” and then to “grow” and “living”
living thing: connects to propositions “is a” and then to either
1. plant
2. animal
plant - connects to “is a”
1. tree
2. flower
these eventually link into specific examples
- pine, oak, rose, daisy
how did Collins and Quillian test if the timing of their network in validating closeness of associations actually applies to human processes?
gave people a sentence that was true or false
had them say whether it was true or false
ie. ‘a canary can sing’, ‘can walk’, ‘has skin’
- looking at properties progressively higher up in the network
turns out that increasingly high properties take longer to validate
are Collins & Quillian’s findings supported in all categories?
no, not validated in all categories
a good first step, but not exhaustive
2 pieces of theoretical refinement: Smith, Shoben & Rips
- proposed that items can be repped as a SET OF FEATURES
- each concept is described by a set of features that define it - meaning can be described as a position in a geometric space
- vectors
vectors
look at how similar and different certain vectors are
use trigonometry to calculate the angles between different vectors
once you have the numerical similarity between the vectors, you can plot how they are distributed in space
vector cosine
calculated using trigonometry that examines angles between different vectors
will come up with value between 1 and -1
1 = the same (very similar)
-1 = opposite
multidimensional scaling
uses the vector cosines to place words in a 2D space
visually shows their similarity
more similar items will be closer to each other within the space
helps visualize how we connect things in our minds
what are features?
classical approaches propose that they are properties of categories
ie. features of cars: “has wheels”, “used for transportation”, “has doors”, “has an engine”
uninterpretable features: multidimensional scaling models
multidimensional scaling don’t carry interpretable features
the locations of things in space don’t map onto features like “has wheels” or “has a door”
can’t say that location x in a matrix means that word y has a door
how do machine learning models construct features?
from text
not typically based in perceptual environment
some are interpretable, others are not
in neural networks, all info is distributed across…
the WIDTH of the network
if you damage the network, all information decays together (not like you just lose a chunk of it)
topic models
probabilistically places words matched on whether that word has a feature or not
ie. probability value that a certain word is a living thing, or is red, or can move etc/
good for information organization, can categorize info well
topic models are good at ______ but aren’t really used as a _______
good at information organization/categorization
but aren’t really used as a theory of cognition
rogers and mclelland worked on what kind of model
neural network
basic idea behind roger and mclelland’s neural network
based of off interest in how children acquire language
take propositions (sentences)
give model a sentence, derived from a representation network model
give the model a word (canary) and proposition (can)
then have an output layer with all sorts of possible options
want model to produce certain options, and not produce others
ie. want it to produce ‘swim’, ‘grow’, ‘fly’ but not ‘swim’
if the model gets something wrong, it uses back-propagation to adjust the weights so that next time it’s less likely to make the same mistake
can do this because it’s a supervised network (we know what we want the network to produce, so we know when it’s wrong)
by end of training cycle, model produces the correct output
models of collins & quillian versus models of rogers & mclelland
collins & quillian:
- hierarchical networks
rogers & mclelland:
- neural networks
supervised networks
we know what we want the network to produce
so we know when it is wrong
allows for back-propagation/error-driven learning
ie. neural networks are supervised
back-propagation
error-driven learning
possible in supervised networks
when we know the output that we want the model to produce
at first, the network will produce “noise” (the wrong things)
but since we know what we want it to produce, we can CHANGE THE CONNECTIONS OF THE WEIGHTS
so that next time it’s incrementally more likely to produce the correct activations
do this hundreds of thousands, millions of times
eventually the network will produce the right activation
error-driven learning is really just…
reinforcement learning
each arrow in a network…
reps a diff weight/numerical value
which is adjusted depending on how incorrect the network is
do we want a high or low learning rate?
low learning rate
so that small changes are made to each input
means that a lot of learning trials are required
generally must be trained multiple times on same corpus
other term for backpropagation
the backward pass
what comes out as the output is essentially just the…
most activated node in the hidden layers
2 main approaches to neural networks
- localist network:
- each node reps only one entity
- people tend to think these are neurologically implausible - distributed representation:
- info is spread across the nodes
- instead of being confined to one node
- preferred, because more similar to brain’s function
issue with the whole ‘input = output from many many hidden layers’ thing
results in a kind of black box model
what exactly is happening in the hidden layers is unclear
can’t “get into the head of the model” - can’t map it onto what humans do in experimental tasks
led to Bayesian models (back-propagation networks that feed into other back-propagation networks…train each layer separately, don’t have to go all the way back to the first layer)
three names associated with back-propagation
Rumelhart, Hinton & Williams (1986)
the trajectory of learning followed by rogers’ and mclelland’s model maps onto…
learning trajectories of children as they acquire language
in the beginning, model produces noise (outputs are all equally likely, close together in 2D space)
but with training, they begin to split apart and are weighted differently (just like how kids begin to learn words)
closed versus open models
closed models:
- restricts the model to working with the training materials
- assumes all of the knowledge about the world = contained in the training materials
- allows for clarity in resulting explanation
open models:
- uses millions of samples
- noise is eventually reduced through greater levels of experience
- better than closed
Rogers and McLelland models = based on what assumption? open or closed networks?
based on the SIMPLIFICATION ASSUMPTION
they are closed networks
“the more detail we incorporate, the harder the model is to understand”
- think of the growing complexity and non-interpretability of chatGPT
simplification assumption
linked to closed models
suggests when you’re training a model you should give it simple training data
because complicated materials make it unclear as to whether the model is succeeding/failing because of the quality of the data
simple data provides researchers with clarity regarding how good the model was
closed models and ecological validity
closed models have low ecological validity
not reflective of tasks that humans actually perform
language is very noisy, lots of info all the time
so using simple training materials doesn’t reflect the task that humans face when they’re learning
open models require _____ information
more
BEAGLE model on 300 versus 300 000 propositions
300 propositions = closed model
- only takes 300 trials to learn propositions
- can cluster info right away
- not error-driven
- presents sentences as more structured than they are in reality
300 000 propositions = open model
- derived from a large corpus of language
- takes much longer to train, about 300 000 trials
why does it take the larger BEAGLE model longer to learn?
because the learning corpus and the actual corpus are different (open model)
the actual corpus has more noise and nuance
therefore takes longer to settle and to produce the correct output
because open models learn from actual sentences, it takes more examples of info to come up with the correct structure
Current NLP Machine Learning Wars
people keep building bigger models, competing against each other
BERT, RoBERTa, GPT-2, T%, Turing NLG, GPT-3
GPT-3 is winning
NLP
natural language processing
is ChatGPT a good model for the brain?
not really
it contains way more info than the human brain does
not really an applicable model with which to assess human cognition
LLM
large language model
ChatGPT, FaceBook, Google
perceptual symbol systems
proposed by Barsalou as a general theory of cognition
classic view: amodal symbols in cognition
amodal systems have NO CONNECTION to perceptual environment
amodal systems have no connection to…
perceptual environment
amodal symbol system transduces a partial perceptual experience into a completely new representation language that is INHERENTLY NON-PERCEPTUAL
3 problems with amodal approach
- neurological evidence:
- findings show that damage to sensory-motor cortex impairs processing of certain modality-based categories (ie. birds) - failure of transduction:
- no system can elegantly go from perception to symbols - symbol grounding problem: how does the system know what it’s computing?
an alternative to amodal systems
neural representations
neural representations
not a physical copy of the perceptual experience
instead a RECORD OF THE NEURAL ACTIVATION that arises during perception
similar to representations of imagery
likely stored in CONVERGENCE ZONES: integrate info in sensory-motor maps to represent knowledge
never completely transduced, perceptual traces are reconstructured
8 examples of semantic memory tasks
many diff behaviours are studied
- word similarity
- false memory
- free association
- semantic priming
- verbal fluency
- sentence comprehension
- discourse comprehension
- feature judgments
semantic memory models: word similarity
most common type of data used for these models
used in model development and model evaluation
give people two words and get them to RATE HOW SIMILAR THEY ARE on a scale
collect ratings from people and average them
compare this number to computational model that’s also learning these words
semantic memory models: verbal fluency
used in more applied situations
ie. diagnosing conditions like alzheimer’s or schizophrenia
give people a category and ask them to generate as many things as possible from that category
compare the model’s output to output of humans - see if the person fits the model made for a schizophrenic, for example
models and dementia
models can examine how language use changes prior to diagnosis
because they’re based on data from people in the years leading up to their diagnosis
can quantitatively see how their memory systems are changing
models = a tool to understand how the mem systems of people with dementia change over time
representation types: network models
words are connected within a semantic network
(ie. ‘release’ connects to ‘capture ‘connects to ‘pirate’ connects to ‘sailor’ connects to ‘anchor’)
generate representation of each item based on the nodes they’re connected to
how are network models typically derived?
from free association data
give people a word (like ‘car’) and get them to generate features associated with these items
this is how they generate the semantic networks/network models
Turk problems
issue with network models
explains human behaviours using other human behaviours
Turk problems arise when the representational input is derived directly from human behavioural data
COMPLEXITY OF THE MODEL = HIDDEN WITHIN THE REPRESENTATION
who coined the Turk problems?
Jones, Hills, Todd
are back-propagation models feature models?
yes!
features are the activation values of the hidden ;ayer
activation of hidden layer can be used as featural rep of a word
important changes occurring in 1990’s-2000’s that helped progress Big Data and Natural Language Processing
pre-1990’s - didn’t have large enough language corpora to train models on
but with internet, larger texts were gathered
2000’s - further movement to digitize existing/old texts
large corpora of text brought in a diff domain of modelling
COLLECTION OF LARGE TEXT HAS CHANGED HOW WE THINK ABOUT STUDYING LANGUAGE
large corpora has changed how we think about studying language…
now possible to PROPOSE LEARNING MECHANISMS and to TRAIN ON REALISTIC DATA
model can be “born” into a realistic language environment
we gain insights into cognition and language performance by examining how it learns/functions
T/F: virtual environments are approaching real world complexity levels
true
NLPs not only help us understand cognition and language performance, but also…
are powerful natural language processing tools
quantification of the natural language environment: Herbert Simon’s take
Herbert Simons said “the apparent complexity of our behaviour over time is largely a reflection of the complexity of the environment in which we find ourselves”
behaviour is adaptive: we shape our cognition to the requirements of our environment
- cognitive system is built such that we can change our behaviours to match the needs of our environment
classic goal in the cognitive sciences
quantification of the natural language environment
quantification of the natural language environment: William Estes’ take
William Estes stated that theories of behaviour should shift “the burden of explanation from hypothesized processes in the organism to statistical properties of environmental events”
saying we should look at how people are learning from the environment/responding to it
he was particularly interested in mathematical properties
distributional models
these types of models learn the meanings of words from the distribution of how they’re used in language
aka embedding models
learn meaning of words from co-occurrence statistics
first major distributional model
Landauer & Dumais (1997)
Latent Semantic Analysis model
Lan and Dum wanted to switch from current algorithms which would simply be cued with specific words and come up with documents with most overlap
they wanted a more MEANING-BASED approach
- get rid of polysemy effect
- introduce recognition of synonymic meanings
LSA works by…
- examining a large corpus of text
- extracting information about how words are used
- information is based on frequency usage for particular words
- build a vector that reps the meaning of the word in terms of its similarity to other words
- decompose the matrix into smaller number of features
is LSA error-free?
yes, there’s no error signal in the model’s learning (unlike neural networks)
simply accumulate information in memory and use that to drive the model
not using predictive process to hone the model’s learning - it treats each lexical experience equally
LSA: supervised or unsupervised?
unsupervised
just learns the structure of the dataset
4 things we need for distributional models
- input
- corpus for model to learn - processing
- learning algorithm
- by which info is gleaned from input, extracted and stored in memory - memory
- feature space
- representation of where we keep info about the word’s meaning - output
- task problem
distributional models: processing/learning mechanism details
neural embedding models take a sentence
they sequentially activate each word on its own
want the model to predict the words that surround that word in that context
predictions = in the output layer
see if the predictions are correct
back-propagate to increase accuracy
problem with chatGPT
it’s too complex
too many layers - we don’t really know what’s happening
it’s a “black box”
Firth quote about word co-occurrences
“you shall know a word by the company it keeps”
context (source of text) for distributional models
many different possibilities
paragraphs, documents, books, authors etc.
when processing a sentence, distributional models pre-process. how?
pre-processing modifies the sentences/inputs to improve processing
- stop list
- subsampling
stop list
stop list of high frequency function words
any word included on the stop list is removed from the sentence
subsampling
first a frequency distribution is run (custom to the corpus in question)
creates a probability distribution - words with super super high frequencies are skipped
if you don’t use stop lists or subsampling to get rid of certain words, then…
the model is quickly overwhelmed
every single word will be understood to be similar to “the”
are there parallel processes to stop lists/subsampling in real people?
yes
eye tracking studies show that when people read a page, they generally skip function words
which is better? stop list or subsampling?
subsampling
gives you more control over what the model is processing
and it’s controlled by parameters
more training flexibility
example of sentence before and after stop list/subsampling
if the solvent is insoluble the mixture can be decanted
solvent insoluble mixture decanted
after pre-processing…
the remaining words are examined
specifically their occurrences with each other word in the corpus
each pair that is found modifies the count in the matrix (strength increases with each pair found)
done word by word: find all the pairs for one word first, then move onto the next word…
fundamental component of the processing of these distribution models….
similarity between words
typical similarity metric in distributional semantics
cosine
use a vector cosine: gives value between 1 (very similar) and -1 (no similarity)
value represents placement of the vectors in a 2D space
highly aligned in terms of featural reps = high similarity value
to determine if our model actually captures any semantic info…
we examine its performance with a word similarity task:
- get people to rate how similar a pair of words are on a scale
- get a set of values pertaining to the relation of words
- TAKE COSINE SIMILARITY OF EACH WORD PAIR (between 1 and -1)
TAKE CORRELATION between the cosine value the model has produced and the similarity value that people are producing
use these values to see how similar the model’s and people’s results are
ideally you want a positive correlation