Dependency Parsing Flashcards

Question 1

Q

Dependency parsing

Answer

A

Dependency parsing is a process in Natural Language Processing (NLP) that aims to analyze the grammatical structure of a sentence by identifying the relationships between its words.

Question 2

Q

Grammatical functions, 2 types

Answer

A

Grammatical functions provide the linguistic basis for the head-dependency relations in dependency structures.

Grammatical functions can be broken into two types:

syntactic roles with respect to a predicate (often a verb); in this case the dependent is called argument (e.g. subject, direct object, indirect object)
functions that describe ways in which words can modify their heads; in this case the dependent is called modifier (noun modifier, determiner, adverb, case)

Question 3

Q

Universal Dependencies (UD)

Answer

A

The Universal Dependencies (UD) project is an inventory of dependency relations that are linguistically motivated and cross-linguistically applicable.

Question 4

Q

Projective and non-projective trees

Answer

A

An arc from a head to a dependent is said to be projective if there is a path from the head to every word that lies between the head and the dependent in the sentence. (think of the example at slide 13 pdf 10)

A dependency tree is said to be projective if all the arcs are projective. Otherwise, the tree is non-projective.

Question 5

Q

Relation between phrase-structure trees and dependency trees

Answer

A

Phrase-structure trees and dependency trees are two types of tree structures that are commonly used to represent the grammatical structure of sentences.

Phrase-structure trees, also known as constituency trees, represent the hierarchical organization of words into phrases
Dependency trees, on the other hand, represent the grammatical relationships between words in terms of dependencies

Question 6

Q

The 2 subtasks of the conversion process from phrase structure to dependency structure

Answer

A

in a phrase structure tree, mark the head element at each node; this can be done using hand-written, deterministic rules
construct a dependency from the head of each node to the head of each child node that does not inherit the head

example at slide 18 pdf 10…

Question 7

Q

Transition-based dependency: configuration of the parser

Answer

A

A configuration of the parser consists of a stack (used as working memory), an input buffer (initialized with the input string), and a set of dependencies constructed so far (partial tree).

The parser is non-deterministic, and an oracle is used to drive the search.

Question 8

Q

Arc-standard parser

Answer

A

We consider the arc-standard model (for projective parsing), using three transition operators:

shift: remove first buffer element and push it into the stack
leftArc: pop second top-most stack element and attach it as a dependent to the top-most element
rightArc: pop top-most stack element and attach it as a dependent to the second top-most element

Parser uses a greedy strategy: the oracle provides a single choice at each step, alternative options are dropped. The oracle takes as input a parser configuration and returns a transition operator (shift, left-arc, right-arc).

example slide 24 pdf 10

Question 9

Q

Spurious ambiguity in the Arc-standard parser

Answer

A

Several transition sequences may lead to the same parse tree.

Question 10

Q

Generating training data for the Arc-standard parser

Answer

A

Given a configuration c and a reference tree t, training instance is produced as follows:

choose leftArc if on c it produces a dependency in t
otherwise, choose rightArc if:
- on c it produces a dependency in t
- all of the dependents of the word at the top of the stack in c have already been assigned
otherwise, choose shift

A set of training instances is a sequence of transition operators wrt their pair (c, t).

We assign precedence to left attachment, thus solving spurious ambiguity (rightArc transitions are performed in the end).

Question 11

Q

Dependency parsing: feature functions and feature templates

Answer

A

Feature extraction for machine learning models is done by defining some feature function f(c, operator) that provides a vector of feature values.

Feature functions are usually specified by means of feature templates, defined as pairs of the form location.properties.

Possible locations are:

the stack s
the buffer b
the set of dependency relations r

Useful properties are:

the word form w
the lemma l
the POS tag t

To avoid data sparseness, we focus on:

the top levels of the stack
the words near the front of the buffer
the dependency relations already associated with the above elements

e.g. s1.w is the word form of the topmost stack element

Question 12

Q

Second order feature templates

Answer

A

Given that the leftArc and rightArc operate on the top two elements of the stack, feature templates that combine properties from these positions are also very useful, called second order feature templates.

give an example (hint: f3107(c, op) = I(…))

Question 13

Q

How a BiLSTM is used for computing the best parsing configuration c

Answer

A

Let w = w1w2 … wn be the input sentence, and let t1t2 …tn be the corresponding POS tags.

We define vectors: xi = e(wi) o e(ti) where o is the concatenation operator.

A BiLSTM is composed by two LSTM LstmL and LstmL reading the sentence in opposite order:

BiLstm(x1:n, i) = LstmL(x1:i)oLstmL(xn:i) = vi

During training:

the BiLSTM encodings vi are fed into further network layers
the back-propagation algorithm is used to optimize the parameters of the BiLSTM

The above training procedure causes the BiLSTM function to extract from the input the relevant information for the task at hand.

Our feature function φ(c) is the concatenation of the BiLSTM vectors for:

the top 3 items on the stack
the first vector on the buffer

φ(c) = vs3 o vs2 o vs1 o vb1

Transition scoring is then defined as a multi-layer perceptron (MLP), τ is a transition of our parser:

Score(φ(c), τ) = MLP(φ(c)))[τ]

We use a margin-based objective, aiming to maximize the margin between:

the highest scoring correct action
the highest scoring incorrect action

Let A be the set of possible transitions and G be the set of correct (gold) transitions at the current step.

The hinge loss at each parsing configuration c is defined as:

max{0, 1- score1 + score2}

Question 14

Q

Graph-based dependency parsing

Answer

A

Graph-based dependency parsing casts parsing as a global optimisation problem over a set of dependency trees.

Let Ψ(x,y) be a scoring function. Given a sentence x and a set of candidate trees Y(x) for x, we want to find a highest-scoring tree ŷ.

write the formula (hint: use argmax)

Question 15

Q

How LSTM networks work

Answer

A

An LSTM maps a sequence of input vectors:

x1:t = x1, …, xt

to a sequence of output vectors:

h1:t = h1, …, ht

where dim(xi) = din and dim(hi) = dout

We schematically write this by means of the recurrent relation:

LSTM(x1:t) = LSTM(ht-1, xt) = ht

Vector ht is conditioned on all the input vectors x1:t.

Question 16

Q