Rules Flashcards
NLP using rules:
- Hand-crafted decision trees: Series of decision rules that classify or segment input text spans
- Finite-State transducers: Series of rules that rewrite matching input spans into output spans
- Template-based generation: predefined string templates filled with information to create new text
Rule-based vs statistical methods
- For most analysis and synthesis tasks, the best results are nowadays achieved with statistical/neutral techniques
- Particularly in industry, rule-based techniques remain to be used because they are often well-controllable and explainable
- All rule-based methods have a statistical counterpart in some way
Hand crafted Decision trees
Why “Hand-crafted”?
- considered here are solely created based on human expert knowledge
- In machine learning, decision trees are created automatically based on statistics derived from data
Hand-crafted vs statistical decision trees
When to use?
For which tasks to use?
- Hand-crafted: the set of decision criteria and their ordering of resulting decision rules are defined manually
- Statistical: the best decision criteria and the best ordering(according to the data) are determined automatically
When to use?
- Decision tree structures get complicated fast
- The number of decision criteria to consider should be small
- The decision criteria should not be too interdependent
- Rule of thumb: few criteria with clear connections to outcomes
For which tasks to use?
- theoretically, there is no real restriction but practically, they are most used for shallow lexical or syntactic analyses
- Rule of thumb: the surface form of a text is enough for the decisions
Tokenization and sentence splitting
Tokenization:
- The task analysis that segments a span of text into its single tokens
- Input: Usually a plain text, possibly segmented into sentences
- Output: A list of tokens, not including whitespace between tokens
Sentence splitting:
- The text analysis that segments a text into its single sentences
- Input: Usually plain text, possibly segmented into tokens
- Output: A list of sentences, not including space between sentences
What first?
- Knowing token boundaries helps identify sentence boundaries
- Knowing sentence boundaries helps to identify token boundaries
The default is to tokenize first, but both schedules exist
Sentence splitting with a decision tree:
- process an input text character by character
- decide for each character whether it is the last character in a sentence
Potential decision criteria for Tokenization and sentence splitting
- End of sentence
- Whitespace
- Comma
- Hyphen
- Period
- Letters and digits
Issues with decision criteria
- The connection of criteria to outcomes is often not straightforward
- For numeric decision criteria, thresholds may be needed
- Often, a weighting of different decision criteria is important
- It is unclear how to find all relevant criteria
Issues with decision trees
Decision trees get complex fast, already for few decision criteria
- The mutual effects of decision rules are hard to foresee
- Adding new decision criteria may change a tree drastically
Benefits and limitations of decision trees
Benefits
- Precise rules can be specified with human expert knowledge
- The behavior of hand-crafted decision trees is well-controllable
- Decision trees are considered to be easily interpretable
Limitations
- The bigger the trees get, the harder it is to adjust them
- Setting them up manually is practically infeasible for complex tasks
- Including weightings is all but straightforward for decision trees
Finite-State Transducers
FSA: is a state machine that read a string from a regular language, it represents the set of all strings belonging to the language
FST: extends an FSA in that reads one string and writes another, it represents the set of all relations between two sets of strings
Ways of employing an FST:
- Translator/Rewriter: read a string i and output another string o
- Recognizer: take a pair of string i:o as input. Output “accept” or “reject”
- Generator: Output pairs of strings i:o from Alphabet
- Set relator: Compute relations between set of strings I and O, such that i belongs to I and o Belongs to O
Morphological analysis as rewriting
- Input: the fully inflected surface form of a word
- Output: the stem + the part-of-speech + the number
- this can be done with an FST that reads a word and writes the output
Knowledge needed:
- Lexicon: Stems with affixes, together with morphological information
- Morphotactics: a model that explains which morpheme classes can follow others inside a word
- Orthographic rules: a model of the changes that may occur in a word, particularly when two morphemes combine
Word Normalization
-The conversion of all words in a text into some defined canonical form
- Used in NLP to identify different forms of the same word
Common character-level word normalizations:
- Case folding: Converting all letters to lower-case
- Removal of special characters: keep only letters and digits
- Removal of diacritical marks: Keep only plain letters without diacritics
Morphological normalization
- Identification of a single canonical representative for morphologically related wordforms
- Reduces inflections(and partly also derivations) to a common base
- Two alternative techniques: stemming and lemmatization
Stemming with FST
with affix elimination:
- Stem a word with rule-based elimination of prefixes and suffixes
- connects, connecting, connection → connect
- embodied, body, bodies → bod
- the elimination may be based on prefix and suffix forms only
Porter stemmer
- Based on a series of cascaded rewrite rules
- Can be implemented as a lexicon-free FST
Steps:
1. Rewrite longest possible match of a given token with a set of defined character sequence patterns
2. Repeat Step 1 until no pattern matches the token anymore
Signature
- Input: A string s(representing a word)
- Output: The identified stem of s