Dependency Parsing Flashcards

1
Q

Dependency parsing

A

Dependency parsing is a process in Natural Language Processing (NLP) that aims to analyze the grammatical structure of a sentence by identifying the relationships between its words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Grammatical functions, 2 types

A

Grammatical functions provide the linguistic basis for the head-dependency relations in dependency structures.

Grammatical functions can be broken into two types:

  • syntactic roles with respect to a predicate (often a verb); in this case the dependent is called argument (e.g. subject, direct object, indirect object)
  • functions that describe ways in which words can modify their heads; in this case the dependent is called modifier (noun modifier, determiner, adverb, case)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Universal Dependencies (UD)

A

The Universal Dependencies (UD) project is an inventory of dependency relations that are linguistically motivated and cross-linguistically applicable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Projective and non-projective trees

A

An arc from a head to a dependent is said to be projective if there is a path from the head to every word that lies between the head and the dependent in the sentence. (think of the example at slide 13 pdf 10)

A dependency tree is said to be projective if all the arcs are projective. Otherwise, the tree is non-projective.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Relation between phrase-structure trees and dependency trees

A

Phrase-structure trees and dependency trees are two types of tree structures that are commonly used to represent the grammatical structure of sentences.

  • Phrase-structure trees, also known as constituency trees, represent the hierarchical organization of words into phrases
  • Dependency trees, on the other hand, represent the grammatical relationships between words in terms of dependencies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The 2 subtasks of the conversion process from phrase structure to dependency structure

A
  1. in a phrase structure tree, mark the head element at each node; this can be done using hand-written, deterministic rules
  2. construct a dependency from the head of each node to the head of each child node that does not inherit the head

example at slide 18 pdf 10…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Transition-based dependency: configuration of the parser

A

A configuration of the parser consists of a stack (used as working memory), an input buffer (initialized with the input string), and a set of dependencies constructed so far (partial tree).

The parser is non-deterministic, and an oracle is used to drive the search.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Arc-standard parser

A

We consider the arc-standard model (for projective parsing), using three transition operators:

  • shift: remove first buffer element and push it into the stack
  • leftArc: pop second top-most stack element and attach it as a dependent to the top-most element
  • rightArc: pop top-most stack element and attach it as a dependent to the second top-most element

Parser uses a greedy strategy: the oracle provides a single choice at each step, alternative options are dropped. The oracle takes as input a parser configuration and returns a transition operator (shift, left-arc, right-arc).

example slide 24 pdf 10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Spurious ambiguity in the Arc-standard parser

A

Several transition sequences may lead to the same parse tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Generating training data for the Arc-standard parser

A

Given a configuration c and a reference tree t, training instance is produced as follows:

  1. choose leftArc if on c it produces a dependency in t
  2. otherwise, choose rightArc if:
    - on c it produces a dependency in t
    - all of the dependents of the word at the top of the stack in c have already been assigned
  3. otherwise, choose shift

A set of training instances is a sequence of transition operators wrt their pair (c, t).

We assign precedence to left attachment, thus solving spurious ambiguity (rightArc transitions are performed in the end).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Dependency parsing: feature functions and feature templates

A

Feature extraction for machine learning models is done by defining some feature function f(c, operator) that provides a vector of feature values.

Feature functions are usually specified by means of feature templates, defined as pairs of the form location.properties.

Possible locations are:

  • the stack s
  • the buffer b
  • the set of dependency relations r

Useful properties are:

  • the word form w
  • the lemma l
  • the POS tag t

To avoid data sparseness, we focus on:

  • the top levels of the stack
  • the words near the front of the buffer
  • the dependency relations already associated with the above elements

e.g. s1.w is the word form of the topmost stack element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Second order feature templates

A

Given that the leftArc and rightArc operate on the top two elements of the stack, feature templates that combine properties from these positions are also very useful, called second order feature templates.

give an example (hint: f3107(c, op) = I(…))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How a BiLSTM is used for computing the best parsing configuration c

A

Let w = w1w2 … wn be the input sentence, and let t1t2 …tn be the corresponding POS tags.

We define vectors: xi = e(wi) o e(ti) where o is the concatenation operator.

A BiLSTM is composed by two LSTM LstmL and LstmL reading the sentence in opposite order:

BiLstm(x1:n, i) = LstmL(x1:i)oLstmL(xn:i) = vi

During training:

  • the BiLSTM encodings vi are fed into further network layers
  • the back-propagation algorithm is used to optimize the parameters of the BiLSTM

The above training procedure causes the BiLSTM function to extract from the input the relevant information for the task at hand.

Our feature function φ(c) is the concatenation of the BiLSTM vectors for:

  • the top 3 items on the stack
  • the first vector on the buffer

φ(c) = vs3 o vs2 o vs1 o vb1

Transition scoring is then defined as a multi-layer perceptron (MLP), τ is a transition of our parser:

Score(φ(c), τ) = MLP(φ(c)))[τ]

We use a margin-based objective, aiming to maximize the margin between:

  • the highest scoring correct action
  • the highest scoring incorrect action

Let A be the set of possible transitions and G be the set of correct (gold) transitions at the current step.

The hinge loss at each parsing configuration c is defined as:

max{0, 1- score1 + score2}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Graph-based dependency parsing

A

Graph-based dependency parsing casts parsing as a global optimisation problem over a set of dependency trees.

Let Ψ(x,y) be a scoring function. Given a sentence x and a set of candidate trees Y(x) for x, we want to find a highest-scoring tree ŷ.

write the formula (hint: use argmax)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How LSTM networks work

A

An LSTM maps a sequence of input vectors:

  • x1:t = x1, …, xt

to a sequence of output vectors:

  • h1:t = h1, …, ht

where dim(xi) = din and dim(hi) = dout

We schematically write this by means of the recurrent relation:

  • LSTM(x1:t) = LSTM(ht-1, xt) = ht

Vector ht is conditioned on all the input vectors x1:t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How a BiLSTM is used for computing the best parsing configuration c

A

Let w = w1w2 … wn be the input sentence, and let t1t2 …tn be the corresponding POS tags.

We define vectors: xi = e(wi) o e(ti) where o is the concatenation operator.

A BiLSTM is composed by two LSTM LstmL and LstmL reading the sentence in opposite order:

BiLstm(x1:n, i) = LstmL(x1:i) o LstmL(xn:i) = vi

During training:

  • the BiLSTM encodings vi are fed into further network layers
  • the back-propagation algorithm is used to optimize the parameters of the BiLSTM

The above training procedure causes the BiLSTM function to extract from the input the relevant information for the task at hand.

Our feature function φ(c) is the concatenation of the BiLSTM vectors for:

  • the top 3 items on the stack
  • the first vector on the buffer

φ(c) = vs3 o vs2 o vs1 o vb1

Transition scoring is then defined as a multi-layer perceptron (MLP), τ is a transition of our parser:

Score(φ(c), τ) = MLP(φ(c)))[τ]

We use a margin-based objective, aiming to maximize the margin between:

  • the highest scoring correct action
  • the highest scoring incorrect action

Let A be the set of possible transitions and G be the set of correct (gold) transitions at the current step.

The hinge loss at each parsing configuration c is defined as:

max{0, 1- score1 + score2}