Rules Flashcards
NLP using rules:
- Hand-crafted decision trees: Series of decision rules that classify or segment input text spans
- Finite-State transducers: Series of rules that rewrite matching input spans into output spans
- Template-based generation: predefined string templates filled with information to create new text
Rule-based vs statistical methods
- For most analysis and synthesis tasks, the best results are nowadays achieved with statistical/neutral techniques
- Particularly in industry, rule-based techniques remain to be used because they are often well-controllable and explainable
- All rule-based methods have a statistical counterpart in some way
Hand crafted Decision trees
Why “Hand-crafted”?
- considered here are solely created based on human expert knowledge
- In machine learning, decision trees are created automatically based on statistics derived from data
Hand-crafted vs statistical decision trees
When to use?
For which tasks to use?
- Hand-crafted: the set of decision criteria and their ordering of resulting decision rules are defined manually
- Statistical: the best decision criteria and the best ordering(according to the data) are determined automatically
When to use?
- Decision tree structures get complicated fast
- The number of decision criteria to consider should be small
- The decision criteria should not be too interdependent
- Rule of thumb: few criteria with clear connections to outcomes
For which tasks to use?
- theoretically, there is no real restriction but practically, they are most used for shallow lexical or syntactic analyses
- Rule of thumb: the surface form of a text is enough for the decisions
Tokenization and sentence splitting
Tokenization:
- The task analysis that segments a span of text into its single tokens
- Input: Usually a plain text, possibly segmented into sentences
- Output: A list of tokens, not including whitespace between tokens
Sentence splitting:
- The text analysis that segments a text into its single sentences
- Input: Usually plain text, possibly segmented into tokens
- Output: A list of sentences, not including space between sentences
What first?
- Knowing token boundaries helps identify sentence boundaries
- Knowing sentence boundaries helps to identify token boundaries
The default is to tokenize first, but both schedules exist
Sentence splitting with a decision tree:
- process an input text character by character
- decide for each character whether it is the last character in a sentence
Potential decision criteria for Tokenization and sentence splitting
- End of sentence
- Whitespace
- Comma
- Hyphen
- Period
- Letters and digits
Issues with decision criteria
- The connection of criteria to outcomes is often not straightforward
- For numeric decision criteria, thresholds may be needed
- Often, a weighting of different decision criteria is important
- It is unclear how to find all relevant criteria
Issues with decision trees
Decision trees get complex fast, already for few decision criteria
- The mutual effects of decision rules are hard to foresee
- Adding new decision criteria may change a tree drastically
Benefits and limitations of decision trees
Benefits
- Precise rules can be specified with human expert knowledge
- The behavior of hand-crafted decision trees is well-controllable
- Decision trees are considered to be easily interpretable
Limitations
- The bigger the trees get, the harder it is to adjust them
- Setting them up manually is practically infeasible for complex tasks
- Including weightings is all but straightforward for decision trees
Finite-State Transducers
FSA: is a state machine that read a string from a regular language, it represents the set of all strings belonging to the language
FST: extends an FSA in that reads one string and writes another, it represents the set of all relations between two sets of strings
Ways of employing an FST:
- Translator/Rewriter: read a string i and output another string o
- Recognizer: take a pair of string i:o as input. Output “accept” or “reject”
- Generator: Output pairs of strings i:o from Alphabet
- Set relator: Compute relations between set of strings I and O, such that i belongs to I and o Belongs to O
Morphological analysis as rewriting
- Input: the fully inflected surface form of a word
- Output: the stem + the part-of-speech + the number
- this can be done with an FST that reads a word and writes the output
Knowledge needed:
- Lexicon: Stems with affixes, together with morphological information
- Morphotactics: a model that explains which morpheme classes can follow others inside a word
- Orthographic rules: a model of the changes that may occur in a word, particularly when two morphemes combine
Word Normalization
-The conversion of all words in a text into some defined canonical form
- Used in NLP to identify different forms of the same word
Common character-level word normalizations:
- Case folding: Converting all letters to lower-case
- Removal of special characters: keep only letters and digits
- Removal of diacritical marks: Keep only plain letters without diacritics
Morphological normalization
- Identification of a single canonical representative for morphologically related wordforms
- Reduces inflections(and partly also derivations) to a common base
- Two alternative techniques: stemming and lemmatization
Stemming with FST
with affix elimination:
- Stem a word with rule-based elimination of prefixes and suffixes
- connects, connecting, connection → connect
- embodied, body, bodies → bod
- the elimination may be based on prefix and suffix forms only
Porter stemmer
- Based on a series of cascaded rewrite rules
- Can be implemented as a lexicon-free FST
Steps:
1. Rewrite longest possible match of a given token with a set of defined character sequence patterns
2. Repeat Step 1 until no pattern matches the token anymore
Signature
- Input: A string s(representing a word)
- Output: The identified stem of s
Issues of Porter stemmer
- Difficult to modify, that is, the effects of changes are hardly predictable
- Tends to overgeneralize:
- Policy → police
- University → universe
- Organization → organ
- Does not capture clear generalizations:
- European and Europe
- Matrices and matrix
- machine and machinery
- Generates some stems that are dificult to intepret:
- Iteration → Iter
- General → gener
Observations:
- The application of rules is trivial, the knowledge is in the rules
- The rules are specific to English but adaption to other language is possible
- The lack of lexicon has limitations
Benefits and Limitations of FST
Benefits of FST
- As for decision trees, precise rules can be specified by human experts
- The behavior of FSTs for simple rewriting tasks is well-controllable
- They also tend to be computationally efficient
Limitations:
- FSTs tend to overgeneralize or to have low coverage
- For more complex tasks, FSTs easily get very complicated
- They are rather restricted to tasks analyzing surface form is enough
Template-based generation
Template-based generation
- Automatic or semi-automatic synthesis of texts based on sentence and discourse templates
- Input: Goal of what to generate, information represented in some way
- Output: A natural language text, conveying the information
Case Study
- Below, we exemplify the generation for the description of a given hotel for a given customer group
Data-to-text (Template-based generation)
- Template-based generation is a data-to-text problem, i.e., structured data is to be encoded in unstructured text
- The data may be given, or is selected as part of the generation process
- Template based generation follows the Standard NLG Process
- Content determination: What to say
- Discourse planning: When to say what
- Sentence aggregation: What to say together
- Lexicalization: How to say what to say
- Referring expression generation: Decide how to refer to it
- Linguistic realization: How to say all together
Content determination:
- Task: decide what information should be communicated in a text
- Process: Retrieve and filter information from some knowledge base
- Result: Entities, attributes, values and relations
Discourse planing:
- Task: Organize the whole text in a coherent way
- Process: Order and structure information using discourse knowledge
- Result: A sequence or tree structure of discourse relations
Sentence Aggregation
- Task: Organize individual information in a fluent and readable way
- Process: Aggregate the information to be communicated into sentences
- Result: a structured representation of each sentence
Lexicalization:
- Task: Encode the information to be conveyed in natural language
- Process: Select words and phrases to express the information
- Result: A first representation in natural language
Referring expression generation
- Task: Replace identifiers of information in a natural, yet clear way
- Process: Select adequate coreferences where connections are clear
- Result: A refined natural language representation
Linguistic realization
- Task: Generate a morphologically and syntactically correct text
- Process: Fill templates and adjust text according to rules of grammar
- Result: the final output text
Benefits and Limitations of Template-based generation
Benefits
- Very sophisticated language patterns can be specified
- As for the techniques above, the behavior is well-controllable
- Template enable near-perfect effectiveness in focused tasks
Limitations
- They are usually domain-specific and presuppose what can be said
- They allow for low linguistic variation only, limiting applicability
- They require much manual labor, limiting scalability
Template
- Templates define constraints and points of variation for any text instance to be generated
- Most common types: sentence templates and discourse templates
Sentence template
- representation of a sentence as boilerplate text and parameters
- Parameters: to be filled by instance-specific concepts and values
- Boilerplate text: more or less unchanged in any text
Discourse template
- Hierarchical or sequential representation of the organization of a text
- Based on discourse relations, series of sentence template, or similar
For what to employ templates?
- Recurring texts of with conventional form and structure
- Situations where natural language is preferred over structured data
- Precise requirements to how texts should look like
- Writing assistance of humans in recurring tasks
Applications in practice
- Answer: questions of predefined types, such as those in Jeopardy
- Formulate: learned rules, such as those of decision trees
- Explain: medical information, such as patient diagnoses
- Produce: texts of predefined forms, such as job offers
- Report on recurring events, such as soccer games
- Describe products and services, such as hotels