Rules Flashcards

1
Q

NLP using rules:

A
  • Hand-crafted decision trees: Series of decision rules that classify or segment input text spans
  • Finite-State transducers: Series of rules that rewrite matching input spans into output spans
  • Template-based generation: predefined string templates filled with information to create new text
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Rule-based vs statistical methods

A
  • For most analysis and synthesis tasks, the best results are nowadays achieved with statistical/neutral techniques
  • Particularly in industry, rule-based techniques remain to be used because they are often well-controllable and explainable
  • All rule-based methods have a statistical counterpart in some way
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hand crafted Decision trees

A

Why “Hand-crafted”?

  • considered here are solely created based on human expert knowledge
  • In machine learning, decision trees are created automatically based on statistics derived from data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Hand-crafted vs statistical decision trees
When to use?
For which tasks to use?

A
  • Hand-crafted: the set of decision criteria and their ordering of resulting decision rules are defined manually
  • Statistical: the best decision criteria and the best ordering(according to the data) are determined automatically

When to use?
- Decision tree structures get complicated fast
- The number of decision criteria to consider should be small
- The decision criteria should not be too interdependent
- Rule of thumb: few criteria with clear connections to outcomes

For which tasks to use?
- theoretically, there is no real restriction but practically, they are most used for shallow lexical or syntactic analyses
- Rule of thumb: the surface form of a text is enough for the decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Tokenization and sentence splitting

A

Tokenization:

  • The task analysis that segments a span of text into its single tokens
  • Input: Usually a plain text, possibly segmented into sentences
  • Output: A list of tokens, not including whitespace between tokens

Sentence splitting:

  • The text analysis that segments a text into its single sentences
  • Input: Usually plain text, possibly segmented into tokens
  • Output: A list of sentences, not including space between sentences

What first?

  • Knowing token boundaries helps identify sentence boundaries
  • Knowing sentence boundaries helps to identify token boundaries

The default is to tokenize first, but both schedules exist

Sentence splitting with a decision tree:

  1. process an input text character by character
  2. decide for each character whether it is the last character in a sentence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Potential decision criteria for Tokenization and sentence splitting

A
  • End of sentence
  • Whitespace
  • Comma
  • Hyphen
  • Period
  • Letters and digits
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Issues with decision criteria

A
  • The connection of criteria to outcomes is often not straightforward
  • For numeric decision criteria, thresholds may be needed
  • Often, a weighting of different decision criteria is important
  • It is unclear how to find all relevant criteria
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Issues with decision trees

A

Decision trees get complex fast, already for few decision criteria
- The mutual effects of decision rules are hard to foresee
- Adding new decision criteria may change a tree drastically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Benefits and limitations of decision trees

A

Benefits

  • Precise rules can be specified with human expert knowledge
  • The behavior of hand-crafted decision trees is well-controllable
  • Decision trees are considered to be easily interpretable

Limitations

  • The bigger the trees get, the harder it is to adjust them
  • Setting them up manually is practically infeasible for complex tasks
  • Including weightings is all but straightforward for decision trees
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Finite-State Transducers

A

FSA: is a state machine that read a string from a regular language, it represents the set of all strings belonging to the language

FST: extends an FSA in that reads one string and writes another, it represents the set of all relations between two sets of strings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Ways of employing an FST:

A
  • Translator/Rewriter: read a string i and output another string o
  • Recognizer: take a pair of string i:o as input. Output “accept” or “reject”
  • Generator: Output pairs of strings i:o from Alphabet
  • Set relator: Compute relations between set of strings I and O, such that i belongs to I and o Belongs to O
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Morphological analysis as rewriting

A
  • Input: the fully inflected surface form of a word
  • Output: the stem + the part-of-speech + the number
  • this can be done with an FST that reads a word and writes the output

Knowledge needed:
- Lexicon: Stems with affixes, together with morphological information
- Morphotactics: a model that explains which morpheme classes can follow others inside a word
- Orthographic rules: a model of the changes that may occur in a word, particularly when two morphemes combine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Word Normalization

A

-The conversion of all words in a text into some defined canonical form
- Used in NLP to identify different forms of the same word

Common character-level word normalizations:
- Case folding: Converting all letters to lower-case
- Removal of special characters: keep only letters and digits
- Removal of diacritical marks: Keep only plain letters without diacritics

Morphological normalization
- Identification of a single canonical representative for morphologically related wordforms
- Reduces inflections(and partly also derivations) to a common base
- Two alternative techniques: stemming and lemmatization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Stemming with FST

A

with affix elimination:

  • Stem a word with rule-based elimination of prefixes and suffixes
    • connects, connecting, connection → connect
    • embodied, body, bodies → bod
  • the elimination may be based on prefix and suffix forms only
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Porter stemmer

A
  • Based on a series of cascaded rewrite rules
  • Can be implemented as a lexicon-free FST

Steps:
1. Rewrite longest possible match of a given token with a set of defined character sequence patterns
2. Repeat Step 1 until no pattern matches the token anymore

Signature
- Input: A string s(representing a word)
- Output: The identified stem of s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Issues of Porter stemmer

A
  • Difficult to modify, that is, the effects of changes are hardly predictable
  • Tends to overgeneralize:
    • Policy → police
    • University → universe
    • Organization → organ
  • Does not capture clear generalizations:
    • European and Europe
    • Matrices and matrix
    • machine and machinery
  • Generates some stems that are dificult to intepret:
    • Iteration → Iter
    • General → gener

Observations:

  • The application of rules is trivial, the knowledge is in the rules
  • The rules are specific to English but adaption to other language is possible
  • The lack of lexicon has limitations
17
Q

Benefits and Limitations of FST

A

Benefits of FST

  • As for decision trees, precise rules can be specified by human experts
  • The behavior of FSTs for simple rewriting tasks is well-controllable
  • They also tend to be computationally efficient

Limitations:

  • FSTs tend to overgeneralize or to have low coverage
  • For more complex tasks, FSTs easily get very complicated
  • They are rather restricted to tasks analyzing surface form is enough
18
Q

Template-based generation

A

Template-based generation

  • Automatic or semi-automatic synthesis of texts based on sentence and discourse templates
  • Input: Goal of what to generate, information represented in some way
  • Output: A natural language text, conveying the information

Case Study

  • Below, we exemplify the generation for the description of a given hotel for a given customer group
19
Q

Data-to-text (Template-based generation)

A
  • Template-based generation is a data-to-text problem, i.e., structured data is to be encoded in unstructured text
  • The data may be given, or is selected as part of the generation process
  • Template based generation follows the Standard NLG Process
    1. Content determination: What to say
    2. Discourse planning: When to say what
    3. Sentence aggregation: What to say together
    4. Lexicalization: How to say what to say
    5. Referring expression generation: Decide how to refer to it
    6. Linguistic realization: How to say all together
20
Q

Content determination:

A
  • Task: decide what information should be communicated in a text
  • Process: Retrieve and filter information from some knowledge base
  • Result: Entities, attributes, values and relations
21
Q

Discourse planing:

A
  • Task: Organize the whole text in a coherent way
  • Process: Order and structure information using discourse knowledge
  • Result: A sequence or tree structure of discourse relations
22
Q

Sentence Aggregation

A
  • Task: Organize individual information in a fluent and readable way
  • Process: Aggregate the information to be communicated into sentences
  • Result: a structured representation of each sentence
23
Q

Lexicalization:

A
  • Task: Encode the information to be conveyed in natural language
  • Process: Select words and phrases to express the information
  • Result: A first representation in natural language
24
Q

Referring expression generation

A
  • Task: Replace identifiers of information in a natural, yet clear way
  • Process: Select adequate coreferences where connections are clear
  • Result: A refined natural language representation
25
Q

Linguistic realization

A
  • Task: Generate a morphologically and syntactically correct text
  • Process: Fill templates and adjust text according to rules of grammar
  • Result: the final output text
26
Q

Benefits and Limitations of Template-based generation

A

Benefits

  • Very sophisticated language patterns can be specified
  • As for the techniques above, the behavior is well-controllable
  • Template enable near-perfect effectiveness in focused tasks

Limitations

  • They are usually domain-specific and presuppose what can be said
  • They allow for low linguistic variation only, limiting applicability
  • They require much manual labor, limiting scalability
27
Q

Template

A
  • Templates define constraints and points of variation for any text instance to be generated
  • Most common types: sentence templates and discourse templates
28
Q

Sentence template

A
  • representation of a sentence as boilerplate text and parameters
  • Parameters: to be filled by instance-specific concepts and values
  • Boilerplate text: more or less unchanged in any text
29
Q

Discourse template

A
  • Hierarchical or sequential representation of the organization of a text
  • Based on discourse relations, series of sentence template, or similar
30
Q

For what to employ templates?

A
  • Recurring texts of with conventional form and structure
  • Situations where natural language is preferred over structured data
  • Precise requirements to how texts should look like
  • Writing assistance of humans in recurring tasks
31
Q

Applications in practice

A
  • Answer: questions of predefined types, such as those in Jeopardy
  • Formulate: learned rules, such as those of decision trees
  • Explain: medical information, such as patient diagnoses
  • Produce: texts of predefined forms, such as job offers
  • Report on recurring events, such as soccer games
  • Describe products and services, such as hotels