Phrase-Structure Parsing Flashcards
Syntactic analysis: constituents and how to indetify them
A constituent, also called phrase (NP, VP, PP, …), is a group of words that function as a single unit within a hierarchical structure.
The constituent structure of a sentence is identified using constituency tests:
- General substitution: replaces the test phrase with some other phrase of the same type
- Coordination: one can compose several constituents together using coordination
- Topicalization: if a sequence can be moved to a different location without affecting grammaticality is likely to be a consituent
Types of constituents
There are several types of constituents, each characterised by the context in which they can occur and their internal structure.
TYPES:
- Sentence, abbreviated S, is a constituents representing a complete proposition or clause.
- Noun phrase, abbreviated NP
- Verb phrase, abbreviated VP
- Prepositional phrase, abbreviated PP
- Adjective phrase, abbreviated AP
Phrase structure, on what it depends
Phrase structures are tree-like representation used to describe a given language’s syntax with:
- leaf nodes representing sentence words
- internal nodes representing word groupings called phrases
It uses PoS tags (N, V, P, A, Det, …) and Phrase tags (S, NP, VP, PP, AP, …)
Phrase structures depend on:
- the linear order of words in the sentence
- the groupings of words into constituents (phrases)
- the hierarchical relation between constituents
example: “Alice eats strawberries with chocolate”
Notions of head, argument and modifier
The head is the word in the phrase that is grammatically the most important.
The head identifies the phrase type: N is the head of an NP, V is the head of a VP, and so forth.
The head selects the arguments and modifiers appearing in the phrase:
-
Arguments are inherent to the meaning of the phrase; they
appear in fixed number depending on the head’s semantics - Modifiers are optional phrases which merely supplement the head with additional information; they can appear in any number
PP-attachment
PP attachment in NLP (Natural Language Processing) refers to the problem of determining the correct attachment of prepositional phrases in a sentence.
write examples at slide 20-23 pdf 8…
Wh-movement
In natural language processing (NLP), wh-movement is an important concept for understanding how questions are formed and how information is extracted from text.
Long distance syntactic movement, also called wh movement, can be represented in phrase structure by means of so called traces
write example at slide 24 pdf 8…
Treebanks
A phrase structure treebank is a parsed text corpus that annotates the syntactic structure of each sentence, resolving ambiguity.
Treebanks are used to train phrase structure grammars, that is, grammars that can model phrase structures.
The Penn Treebank was the first large-scale treebank. Published in the early 1990s, it revolutionised NLP.
Probabilistic CFG
A probabilistic context-free grammar (PCFG) allows to define a probability distribution over the set of generated parse trees.
Suppose a sentence has several parse trees. If rule probabilities are estimated appropriately, we could solve ambiguity by selecting the parse tree with the highest probability.
Proper probabilistic CFG
For each rule A -> α, we specify the probability that the rule applies to A to produce α.
A PCFG is proper if for each A we have sum over α P(A -> α) = 1
Define the probability of a tree t
P(t) = prod (A->α) (P(A->α))^f(t, A->α)
where f(t, A->α) is the number of times the rule is used in t.
Probability of a string w generated by a CFG
Formally, let T(w) be the set of all parse trees of w. Then:
P(w) = sum (t in T(w)) P(t)
where P(t) is the probability of the tree t.
Consistent probabilistic CFG
A PCFG is consistent if sum over w P(w) = 1.
Consistency also means sum over t P(t) = 1 for parse trees t generated by the grammar.
- Surprisingly enough, a proper PCFG is not always consistent. Probability mass is lost in infinite length parse trees that never produce a finite string.
Lexicalised CFGs vs normal CFGs
WHY:
In our previous CFGs, each phrase records the head type but not the lexical content of the head. This makes the model insensitive to lexical selection, resulting in a loss in accuracy.
SO:
Lexicalised CFG includes information about the lexical contents of the words in a sentence.