Dependency Parsing Flashcards
Dependency parsing
Dependency parsing is a process in Natural Language Processing (NLP) that aims to analyze the grammatical structure of a sentence by identifying the relationships between its words.
Grammatical functions, 2 types
Grammatical functions provide the linguistic basis for the head-dependency relations in dependency structures.
Grammatical functions can be broken into two types:
- syntactic roles with respect to a predicate (often a verb); in this case the dependent is called argument (e.g. subject, direct object, indirect object)
- functions that describe ways in which words can modify their heads; in this case the dependent is called modifier (noun modifier, determiner, adverb, case)
Universal Dependencies (UD)
The Universal Dependencies (UD) project is an inventory of dependency relations that are linguistically motivated and cross-linguistically applicable.
Projective and non-projective trees
An arc from a head to a dependent is said to be projective if there is a path from the head to every word that lies between the head and the dependent in the sentence. (think of the example at slide 13 pdf 10)
A dependency tree is said to be projective if all the arcs are projective. Otherwise, the tree is non-projective.
Relation between phrase-structure trees and dependency trees
Phrase-structure trees and dependency trees are two types of tree structures that are commonly used to represent the grammatical structure of sentences.
- Phrase-structure trees, also known as constituency trees, represent the hierarchical organization of words into phrases
- Dependency trees, on the other hand, represent the grammatical relationships between words in terms of dependencies
The 2 subtasks of the conversion process from phrase structure to dependency structure
- in a phrase structure tree, mark the head element at each node; this can be done using hand-written, deterministic rules
- construct a dependency from the head of each node to the head of each child node that does not inherit the head
example at slide 18 pdf 10…
Transition-based dependency: configuration of the parser
A configuration of the parser consists of a stack (used as working memory), an input buffer (initialized with the input string), and a set of dependencies constructed so far (partial tree).
The parser is non-deterministic, and an oracle is used to drive the search.
Arc-standard parser
We consider the arc-standard model (for projective parsing), using three transition operators:
- shift: remove first buffer element and push it into the stack
- leftArc: pop second top-most stack element and attach it as a dependent to the top-most element
- rightArc: pop top-most stack element and attach it as a dependent to the second top-most element
Parser uses a greedy strategy: the oracle provides a single choice at each step, alternative options are dropped. The oracle takes as input a parser configuration and returns a transition operator (shift, left-arc, right-arc).
example slide 24 pdf 10
Spurious ambiguity in the Arc-standard parser
Several transition sequences may lead to the same parse tree.
Generating training data for the Arc-standard parser
Given a configuration c and a reference tree t, training instance is produced as follows:
- choose leftArc if on c it produces a dependency in t
- otherwise, choose rightArc if:
- on c it produces a dependency in t
- all of the dependents of the word at the top of the stack in c have already been assigned - otherwise, choose shift
A set of training instances is a sequence of transition operators wrt their pair (c, t).
We assign precedence to left attachment, thus solving spurious ambiguity (rightArc transitions are performed in the end).
Dependency parsing: feature functions and feature templates
Feature extraction for machine learning models is done by defining some feature function f(c, operator) that provides a vector of feature values.
Feature functions are usually specified by means of feature templates, defined as pairs of the form location.properties.
Possible locations are:
- the stack s
- the buffer b
- the set of dependency relations r
Useful properties are:
- the word form w
- the lemma l
- the POS tag t
To avoid data sparseness, we focus on:
- the top levels of the stack
- the words near the front of the buffer
- the dependency relations already associated with the above elements
e.g. s1.w is the word form of the topmost stack element
Second order feature templates
Given that the leftArc and rightArc operate on the top two elements of the stack, feature templates that combine properties from these positions are also very useful, called second order feature templates.
give an example (hint: f3107(c, op) = I(…))
How a BiLSTM is used for computing the best parsing configuration c
Let w = w1w2 … wn be the input sentence, and let t1t2 …tn be the corresponding POS tags.
We define vectors: xi = e(wi) o e(ti) where o is the concatenation operator.
A BiLSTM is composed by two LSTM LstmL and LstmL reading the sentence in opposite order:
BiLstm(x1:n, i) = LstmL(x1:i)oLstmL(xn:i) = vi
During training:
- the BiLSTM encodings vi are fed into further network layers
- the back-propagation algorithm is used to optimize the parameters of the BiLSTM
The above training procedure causes the BiLSTM function to extract from the input the relevant information for the task at hand.
Our feature function φ(c) is the concatenation of the BiLSTM vectors for:
- the top 3 items on the stack
- the first vector on the buffer
φ(c) = vs3 o vs2 o vs1 o vb1
Transition scoring is then defined as a multi-layer perceptron (MLP), τ is a transition of our parser:
Score(φ(c), τ) = MLP(φ(c)))[τ]
We use a margin-based objective, aiming to maximize the margin between:
- the highest scoring correct action
- the highest scoring incorrect action
Let A be the set of possible transitions and G be the set of correct (gold) transitions at the current step.
The hinge loss at each parsing configuration c is defined as:
max{0, 1- score1 + score2}
Graph-based dependency parsing
Graph-based dependency parsing casts parsing as a global optimisation problem over a set of dependency trees.
Let Ψ(x,y) be a scoring function. Given a sentence x and a set of candidate trees Y(x) for x, we want to find a highest-scoring tree ŷ.
write the formula (hint: use argmax)
How LSTM networks work
An LSTM maps a sequence of input vectors:
- x1:t = x1, …, xt
to a sequence of output vectors:
- h1:t = h1, …, ht
where dim(xi) = din and dim(hi) = dout
We schematically write this by means of the recurrent relation:
- LSTM(x1:t) = LSTM(ht-1, xt) = ht
Vector ht is conditioned on all the input vectors x1:t.