lecture 2 Flashcards

1
Q

areas of linguistics

A
  1. phonetics
  2. phonology
  3. morphology
  4. syntax
  5. semantics
  6. pragmatics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

phonetics

A

sounds of human language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

phonology

A

sound systems in human languages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

morphology

A

formation and internal structure of words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

syntax

A

formation and internal structure of sentences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

semantics

A

meaning of sentences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

pragmatics

A

study of the way sentences with their semantic meanings are used for particular communicative goals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

NLP and text

A

much of NLP focuses on text only, leaving out many layers of natural language

e.g., phonetics/phonology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

natural language is

A
  1. compositional
  2. arbitrary
  3. creative
  4. displaced
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

compositional

A

the meaning of a sentence is the sum of the meaning of individual words (semantics) and how they are combined (syntax)

[set of rules that define grammaticality] + [lexicon of words that relate to the world we want to talk about]

meaning of an expression = semantics + syntax

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

arbitrary

A

the link between form and meaning is arbitrary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

creative

A

every language can create an infinite number of possible new words and sentences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

displaced

A

we can talk about thing that are not immediately present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

human natural language

A
  1. there is a critical period for acquiring language
  2. children need to receive real input to acquire language
  3. language is interconnected with other cognitive abilities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

structure & grammar

A
  1. structure dictates how we can use language
  2. we implicitly know complex rules about structure.
  3. a community of speakers share a rough consent of their implicit rules. a grammar attempts to describe these rules.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

descriptive linguistics

A

how language is studied

focuses on describing how language is used in practice, without making judgments about correctness.

aims to objectively analyze and document rules that speakers naturally follow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

prescriptive linguistics

A

how language is taught

prescribes rules about how language should be used

often involves enforcing traditional rules and norms, which may not reflect actual usage

18
Q

language rules in education (grammar)

A

the rules taught as part of language education often serve purposes beyond describing the language

they often reflect social, cultural, and political influences

19
Q

grammaticality

A

a community of speakers share a rough consent of their implicit rules.

  • all utterances we can generate from these rules are grammatical
  • if we cannot produce an utterance using these rules, its ungrammatical
  1. SVO order
  2. subject & object pronouns
  3. sentences can be grammatically correct without any meaning
  4. idiolects
20
Q

grammaticality rules accept useless utterances and block out communicative utterances.

why do we need rules?

A
  • if we ignore rules because we know what was probably intended, we actually limit possibilities
  • rules give us expressivity
21
Q

NLP before self-supervised learning

A

the way to approach NLP was through understanding the human language system, and trying to imitate it (rule-based)

  • probing
  • reverse engineering
22
Q

probing

A

small unsupervised models that are trained to extract linguistic information from another model’s output

this helps understand how well different layers of an LLM capture various linguistic features

23
Q

reverse engineering language

A
  1. syntax: parse the input to understand its grammatical structure
  2. semantics: interpret meaning of the parsed input
  3. discourse: understand the broader context and relationships between sentences

the process involves using theories to inform the design of NLP models, ensuring they can parse, understand, and generate human language

24
Q

testing an LLMs understanding of syntax

A
  1. jabberwocky sentences
  2. learning to apply grammatical rules from vast amounts of text data
  3. word order
  4. lexical generalization
25
Q

jabberwocky sentences

A

testing whether language models can understand and represent syntactic structures, even with nonsensical or novel words

to see if a model’s latent space encodes structural information

26
Q

word order

A

word order determines what a subject and what an object is

a model’s latent space can represent syntactic roles based on the positions of words in the space

LLMs often dont care about word order, which affects its ability to grasp syntactic roles

27
Q

lexical generalization

A

task: semantic interpretation

testing how models handle novel words within familiar syntactic structures
–> i.e., whether models can generalize learned structures

COGS measure

generalization is hard for seq2seq models, not as hard for models with structure built in
–> structure-aware models are therefore better

28
Q

COGS measure

A

tests models on their ability to generalize compositions of words and structures, by ensuring words/structures are different between the training and test set

29
Q

why is NLP hard

A
  1. ambiguity
  2. sparse data due to zipf’s law
  3. variation
  4. expressivity
  5. context dependence
  6. unknown representation
  7. spoken & grounded
30
Q

ambiguity

A

language has ambiguity at many levels
–> word senses, POS, syntactic structure, quantifier scope, multiple meanings

solution:
1. non-probabilistic methods: return all possible analyses
2. probabilistic models: return the best possible analyses
–> ‘best’ is only good if our probabilities are accurate

31
Q

statistical NLP

A
  1. typically more robust than earlier rule-based methods
  2. relevant statistics/probabilities are learned from data
  3. normally requires lots of data about any particular phenomenon
32
Q

sparse data due to zipf’s law

A
  • rank-frequency distribution is an inverse relation
  • f * r = k
  • a small number of words are very common, while the majority of words is rare, making it difficult for models to learn effectively from limited instances
33
Q

variation

A
  • POS tags are trained on formal language, which makes it hard to use on informal language such as seen on social media
  • different contexts, vocabulary, grammatical structures, and varieties in language can reduce the tagger’s effectiveness
34
Q

expressivity

A
  • one form can have different meanings
    –> e.g., bank
  • the same meaning can have different forms
    –> e.g., ‘some kids popped by’ vs ‘some children visited’
35
Q

context dependence + unknown representation

A

correct interpretation is context-dependent and often requires world knowledge

–> e.g., interpretation of ‘he dropped the ball’

36
Q

the role of meaning in linguistic structure

A

the meaning of words contributes to the overall structure and coherence of language

37
Q

how models are trained

A
  1. input with rich semantic embedding
  2. add positions to embeddings
  3. masked self-attention
  4. feed-forwards
  5. linear transformations
  6. softmax
  7. probabilities
38
Q

differential object marking

A

different languages handle syntactic representation of objects within a sentence in different ways

understanding differential object marking is crucial for NLP systems to accurately process and generate language representations across diverse languages

LMs are aware of these gradations, but animacy influences this grammatical distinction
–> animate entities are more likely to be subjects (agents)

39
Q

maybe not all structure-word combinations are possible

A

making less plausible situations more explicit is a common feature of grammatical structure

40
Q

meaning is not always compositional

A
  1. idioms and metaphors: meaning cannot be directly inferred from the words themselves
  2. we’re constantly using constructions that we couldn’t get from just a syntactic + semantic parse

i.e., language understanding requires more than just analyzing individual words and their immediate syntax

41
Q

balancing surface-level memorization and deeper understanding within NLP models

A

high dimensional spaces are much better at capturing specific subtleties than any rules we could come up with