lecture 2 Flashcards
areas of linguistics
- phonetics
- phonology
- morphology
- syntax
- semantics
- pragmatics
phonetics
sounds of human language
phonology
sound systems in human languages
morphology
formation and internal structure of words
syntax
formation and internal structure of sentences
semantics
meaning of sentences
pragmatics
study of the way sentences with their semantic meanings are used for particular communicative goals
NLP and text
much of NLP focuses on text only, leaving out many layers of natural language
e.g., phonetics/phonology
natural language is
- compositional
- arbitrary
- creative
- displaced
compositional
the meaning of a sentence is the sum of the meaning of individual words (semantics) and how they are combined (syntax)
[set of rules that define grammaticality] + [lexicon of words that relate to the world we want to talk about]
meaning of an expression = semantics + syntax
arbitrary
the link between form and meaning is arbitrary
creative
every language can create an infinite number of possible new words and sentences
displaced
we can talk about thing that are not immediately present
human natural language
- there is a critical period for acquiring language
- children need to receive real input to acquire language
- language is interconnected with other cognitive abilities
structure & grammar
- structure dictates how we can use language
- we implicitly know complex rules about structure.
- a community of speakers share a rough consent of their implicit rules. a grammar attempts to describe these rules.
descriptive linguistics
how language is studied
focuses on describing how language is used in practice, without making judgments about correctness.
aims to objectively analyze and document rules that speakers naturally follow
prescriptive linguistics
how language is taught
prescribes rules about how language should be used
often involves enforcing traditional rules and norms, which may not reflect actual usage
language rules in education (grammar)
the rules taught as part of language education often serve purposes beyond describing the language
they often reflect social, cultural, and political influences
grammaticality
a community of speakers share a rough consent of their implicit rules.
- all utterances we can generate from these rules are grammatical
- if we cannot produce an utterance using these rules, its ungrammatical
- SVO order
- subject & object pronouns
- sentences can be grammatically correct without any meaning
- idiolects
grammaticality rules accept useless utterances and block out communicative utterances.
why do we need rules?
- if we ignore rules because we know what was probably intended, we actually limit possibilities
- rules give us expressivity
NLP before self-supervised learning
the way to approach NLP was through understanding the human language system, and trying to imitate it (rule-based)
- probing
- reverse engineering
probing
small unsupervised models that are trained to extract linguistic information from another model’s output
this helps understand how well different layers of an LLM capture various linguistic features
reverse engineering language
- syntax: parse the input to understand its grammatical structure
- semantics: interpret meaning of the parsed input
- discourse: understand the broader context and relationships between sentences
the process involves using theories to inform the design of NLP models, ensuring they can parse, understand, and generate human language
testing an LLMs understanding of syntax
- jabberwocky sentences
- learning to apply grammatical rules from vast amounts of text data
- word order
- lexical generalization
jabberwocky sentences
testing whether language models can understand and represent syntactic structures, even with nonsensical or novel words
to see if a model’s latent space encodes structural information
word order
word order determines what a subject and what an object is
a model’s latent space can represent syntactic roles based on the positions of words in the space
LLMs often dont care about word order, which affects its ability to grasp syntactic roles
lexical generalization
task: semantic interpretation
testing how models handle novel words within familiar syntactic structures
–> i.e., whether models can generalize learned structures
COGS measure
generalization is hard for seq2seq models, not as hard for models with structure built in
–> structure-aware models are therefore better
COGS measure
tests models on their ability to generalize compositions of words and structures, by ensuring words/structures are different between the training and test set
why is NLP hard
- ambiguity
- sparse data due to zipf’s law
- variation
- expressivity
- context dependence
- unknown representation
- spoken & grounded
ambiguity
language has ambiguity at many levels
–> word senses, POS, syntactic structure, quantifier scope, multiple meanings
solution:
1. non-probabilistic methods: return all possible analyses
2. probabilistic models: return the best possible analyses
–> ‘best’ is only good if our probabilities are accurate
statistical NLP
- typically more robust than earlier rule-based methods
- relevant statistics/probabilities are learned from data
- normally requires lots of data about any particular phenomenon
sparse data due to zipf’s law
- rank-frequency distribution is an inverse relation
- f * r = k
- a small number of words are very common, while the majority of words is rare, making it difficult for models to learn effectively from limited instances
variation
- POS tags are trained on formal language, which makes it hard to use on informal language such as seen on social media
- different contexts, vocabulary, grammatical structures, and varieties in language can reduce the tagger’s effectiveness
expressivity
- one form can have different meanings
–> e.g., bank - the same meaning can have different forms
–> e.g., ‘some kids popped by’ vs ‘some children visited’
context dependence + unknown representation
correct interpretation is context-dependent and often requires world knowledge
–> e.g., interpretation of ‘he dropped the ball’
the role of meaning in linguistic structure
the meaning of words contributes to the overall structure and coherence of language
how models are trained
- input with rich semantic embedding
- add positions to embeddings
- masked self-attention
- feed-forwards
- linear transformations
- softmax
- probabilities
differential object marking
different languages handle syntactic representation of objects within a sentence in different ways
understanding differential object marking is crucial for NLP systems to accurately process and generate language representations across diverse languages
LMs are aware of these gradations, but animacy influences this grammatical distinction
–> animate entities are more likely to be subjects (agents)
maybe not all structure-word combinations are possible
making less plausible situations more explicit is a common feature of grammatical structure
meaning is not always compositional
- idioms and metaphors: meaning cannot be directly inferred from the words themselves
- we’re constantly using constructions that we couldn’t get from just a syntactic + semantic parse
i.e., language understanding requires more than just analyzing individual words and their immediate syntax
balancing surface-level memorization and deeper understanding within NLP models
high dimensional spaces are much better at capturing specific subtleties than any rules we could come up with