- Hand-crafted decision trees: Series of decision rules that classify or segment input text spans - Finite-State transducers: Series of rules that rewrite matching input spans into output spans - Template-based generation: predefined string templates filled with information to create new text

with affix elimination: - Stem a word with rule-based elimination of prefixes and suffixes - connects, connecting, connection → connect - embodied, body, bodies → bod - the elimination may be based on prefix and suffix forms only

Rules Flashcards by alexander Unknown

NLP using rules:

Hand-crafted decision trees: Series of decision rules that classify or segment input text spans
Finite-State transducers: Series of rules that rewrite matching input spans into output spans
Template-based generation: predefined string templates filled with information to create new text

How well did you know this?

Not at all

Perfectly

Rule-based vs statistical methods

For most analysis and synthesis tasks, the best results are nowadays achieved with statistical/neutral techniques
Particularly in industry, rule-based techniques remain to be used because they are often well-controllable and explainable
All rule-based methods have a statistical counterpart in some way

How well did you know this?

Not at all

Perfectly

Hand crafted Decision trees

Why “Hand-crafted”?

considered here are solely created based on human expert knowledge
In machine learning, decision trees are created automatically based on statistics derived from data

How well did you know this?

Not at all

Perfectly

Hand-crafted vs statistical decision trees
When to use?
For which tasks to use?

Hand-crafted: the set of decision criteria and their ordering of resulting decision rules are defined manually
Statistical: the best decision criteria and the best ordering(according to the data) are determined automatically

When to use?
- Decision tree structures get complicated fast
- The number of decision criteria to consider should be small
- The decision criteria should not be too interdependent
- Rule of thumb: few criteria with clear connections to outcomes

For which tasks to use?
- theoretically, there is no real restriction but practically, they are most used for shallow lexical or syntactic analyses
- Rule of thumb: the surface form of a text is enough for the decisions

How well did you know this?

Not at all

Perfectly

Tokenization and sentence splitting

Tokenization:

The task analysis that segments a span of text into its single tokens
Input: Usually a plain text, possibly segmented into sentences
Output: A list of tokens, not including whitespace between tokens

Sentence splitting:

The text analysis that segments a text into its single sentences
Input: Usually plain text, possibly segmented into tokens
Output: A list of sentences, not including space between sentences

What first?

Knowing token boundaries helps identify sentence boundaries
Knowing sentence boundaries helps to identify token boundaries

The default is to tokenize first, but both schedules exist

Sentence splitting with a decision tree:

process an input text character by character
decide for each character whether it is the last character in a sentence

How well did you know this?

Not at all

Perfectly

Potential decision criteria for Tokenization and sentence splitting

End of sentence
Whitespace
Comma
Hyphen
Period
Letters and digits

How well did you know this?

Not at all

Perfectly

Issues with decision criteria

The connection of criteria to outcomes is often not straightforward
For numeric decision criteria, thresholds may be needed
Often, a weighting of different decision criteria is important
It is unclear how to find all relevant criteria

How well did you know this?

Not at all

Perfectly

Issues with decision trees

Decision trees get complex fast, already for few decision criteria
- The mutual effects of decision rules are hard to foresee
- Adding new decision criteria may change a tree drastically

How well did you know this?

Not at all

Perfectly

Benefits and limitations of decision trees

Benefits

Precise rules can be specified with human expert knowledge
The behavior of hand-crafted decision trees is well-controllable
Decision trees are considered to be easily interpretable

Limitations

The bigger the trees get, the harder it is to adjust them
Setting them up manually is practically infeasible for complex tasks
Including weightings is all but straightforward for decision trees

How well did you know this?

Not at all

Perfectly

Finite-State Transducers

FSA: is a state machine that read a string from a regular language, it represents the set of all strings belonging to the language

FST: extends an FSA in that reads one string and writes another, it represents the set of all relations between two sets of strings

How well did you know this?

Not at all

Perfectly

Ways of employing an FST:

Translator/Rewriter: read a string i and output another string o
Recognizer: take a pair of string i:o as input. Output “accept” or “reject”
Generator: Output pairs of strings i:o from Alphabet
Set relator: Compute relations between set of strings I and O, such that i belongs to I and o Belongs to O

How well did you know this?

Not at all

Perfectly

Morphological analysis as rewriting

Input: the fully inflected surface form of a word
Output: the stem + the part-of-speech + the number
this can be done with an FST that reads a word and writes the output

Knowledge needed:
- Lexicon: Stems with affixes, together with morphological information
- Morphotactics: a model that explains which morpheme classes can follow others inside a word
- Orthographic rules: a model of the changes that may occur in a word, particularly when two morphemes combine

How well did you know this?

Not at all

Perfectly

Word Normalization

-The conversion of all words in a text into some defined canonical form
- Used in NLP to identify different forms of the same word

Common character-level word normalizations:
- Case folding: Converting all letters to lower-case
- Removal of special characters: keep only letters and digits
- Removal of diacritical marks: Keep only plain letters without diacritics

Morphological normalization
- Identification of a single canonical representative for morphologically related wordforms
- Reduces inflections(and partly also derivations) to a common base
- Two alternative techniques: stemming and lemmatization

How well did you know this?

Not at all

Perfectly

Stemming with FST

with affix elimination:

Stem a word with rule-based elimination of prefixes and suffixes
- connects, connecting, connection → connect
- embodied, body, bodies → bod
the elimination may be based on prefix and suffix forms only

How well did you know this?

Not at all

Perfectly

Porter stemmer

Based on a series of cascaded rewrite rules
Can be implemented as a lexicon-free FST

Steps:
1. Rewrite longest possible match of a given token with a set of defined character sequence patterns
2. Repeat Step 1 until no pattern matches the token anymore

Signature
- Input: A string s(representing a word)
- Output: The identified stem of s

How well did you know this?

Not at all

Perfectly

Issues of Porter stemmer

Study These Flashcards

Difficult to modify, that is, the effects of changes are hardly predictable
Tends to overgeneralize:
- Policy → police
- University → universe
- Organization → organ
Does not capture clear generalizations:
- European and Europe
- Matrices and matrix
- machine and machinery
Generates some stems that are dificult to intepret:
- Iteration → Iter
- General → gener

Observations:

The application of rules is trivial, the knowledge is in the rules
The rules are specific to English but adaption to other language is possible
The lack of lexicon has limitations

Benefits and Limitations of FST

Study These Flashcards

Benefits of FST

As for decision trees, precise rules can be specified by human experts
The behavior of FSTs for simple rewriting tasks is well-controllable
They also tend to be computationally efficient

Limitations:

FSTs tend to overgeneralize or to have low coverage
For more complex tasks, FSTs easily get very complicated
They are rather restricted to tasks analyzing surface form is enough

Template-based generation

Study These Flashcards

Template-based generation

Automatic or semi-automatic synthesis of texts based on sentence and discourse templates
Input: Goal of what to generate, information represented in some way
Output: A natural language text, conveying the information

Case Study

Below, we exemplify the generation for the description of a given hotel for a given customer group

Data-to-text (Template-based generation)

Study These Flashcards

Template-based generation is a data-to-text problem, i.e., structured data is to be encoded in unstructured text
The data may be given, or is selected as part of the generation process
Template based generation follows the Standard NLG Process
1. Content determination: What to say
2. Discourse planning: When to say what
3. Sentence aggregation: What to say together
4. Lexicalization: How to say what to say
5. Referring expression generation: Decide how to refer to it
6. Linguistic realization: How to say all together

Content determination:

Study These Flashcards

Task: decide what information should be communicated in a text
Process: Retrieve and filter information from some knowledge base
Result: Entities, attributes, values and relations

Discourse planing:

Study These Flashcards

Task: Organize the whole text in a coherent way
Process: Order and structure information using discourse knowledge
Result: A sequence or tree structure of discourse relations

Sentence Aggregation

Study These Flashcards

Task: Organize individual information in a fluent and readable way
Process: Aggregate the information to be communicated into sentences
Result: a structured representation of each sentence

Lexicalization:

Study These Flashcards

Task: Encode the information to be conveyed in natural language
Process: Select words and phrases to express the information
Result: A first representation in natural language

Referring expression generation

Study These Flashcards

Task: Replace identifiers of information in a natural, yet clear way
Process: Select adequate coreferences where connections are clear
Result: A refined natural language representation

Linguistic realization

- Task: Generate a morphologically and syntactically correct text - Process: Fill templates and adjust text according to rules of grammar - Result: the final output text

Benefits and Limitations of Template-based generation

Benefits - Very sophisticated language patterns can be specified - As for the techniques above, the behavior is well-controllable - Template enable near-perfect effectiveness in focused tasks Limitations - They are usually domain-specific and presuppose what can be said - They allow for low linguistic variation only, limiting applicability - They require much manual labor, limiting scalability

Template

- Templates define constraints and points of variation for any text instance to be generated - Most common types: sentence templates and discourse templates

Sentence template

- representation of a sentence as boilerplate text and parameters - Parameters: to be filled by instance-specific concepts and values - Boilerplate text: more or less unchanged in any text

Discourse template

- Hierarchical or sequential representation of the organization of a text - Based on discourse relations, series of sentence template, or similar

For what to employ templates?

- Recurring texts of with conventional form and structure - Situations where natural language is preferred over structured data - Precise requirements to how texts should look like - Writing assistance of humans in recurring tasks

Applications in practice

- Answer: questions of predefined types, such as those in Jeopardy - Formulate: learned rules, such as those of decision trees - Explain: medical information, such as patient diagnoses - Produce: texts of predefined forms, such as job offers - Report on recurring events, such as soccer games - Describe products and services, such as hotels

Rules Flashcards

(31 cards)