Lexicon Flashcards
Lexicon
A repository of terms that represents a language, a vocabulary or similar
Why ordering:
- For humans: to enable comfortable searching and browsing
- For computers: to enable efficient search
Representation of lexicons:
- As ordered list for binary search over ordering
- As hashsets or hashmaps for direct access to entries
- As regular expressions: for use as part of string patterns
Types of lexicons
Terms only
Term with definitions:
Terms with information
Terms only
- Term list: a simple list of terms, used e.g to cover all possible instances of a specific concept
- Language lexicon: words along with their stems, affixes, and inflections, used e.g. for morphological analysis
- Vocabulary: A list of terms that is known or used in a particular context, use e.g. to cover linguistic style
Term with definitions:
- Dictionary: a list of terms along with their definitions, grammatical information, and more, could be used to compare term meaning
- Glossary: a vocabulary with term definitions, could be used to compare term meaning
- Thesaurus: a dictionary of synonyms, with information on related terms, used e.g. to find similar terms
Terms with information
- Gazetteers: Location names along with metadata, used e.g. as part of entity recognizers
- Frequency list: Terms along with their absolute or relative frequency in some text collection, used e.g. to decide what terms to use as machine learning features
- Confidence lexicons: Terms along with confidence values(or probabilities) to represent some concept, used e.g. for attribute extraction
Selected analysis tasks
- disambiguation of punctuation, as in abbreviations
- Morphological analysis of words
- Attribute extraction(e.g. product aspects)
- Entity recognition(e.g time information)
- Style analysis of text(e.g. formal vs informal language )
- Sentiment analysis of text(e.g. positive vs negative words)
- Social bias detection based on social group terms and bias terms
Selected generation tasks:
- Templated-based generation of texts
- Spelling correction of words
- Language modeling to predict next words
Lexicon Acquisition
Getting seed terms
Expanding the lexicon(possibly incrementally)
Finalizing the lexicon
Getting seed terms(Lexicon Acquisition)
- he first step is often to come up with a set of initial terms
- These terms usually closely relate to the core idea of a given concept
Techniques to get seed terms:
- Expert may handcraft an initial list of seed terms
- Seed terms may be obtained from an annotation study
- Predefined term lists may exist already somewhere
How many seed terms?
- The number depends on the concept of interest and on the feasible amount of manual labor
- In practice, typical numbers range from a handful to a few hundreds
Expanding the lexicon(possibly incrementally)
- In many cases, seed terms do not sufficiently cover a give concept
- Lexicons may then be expanded by terms related to the seeds
Techniques of expand a lexicon:
- Find terms cooccurring with the seeds in a given corpus
- compute similarities between seeds and other terms
- Train a term classifier on texts with the seeds and apply it
How to use these for expansion?
- Many techniques create some numeric score for each candidate term
- The terms can thus be ranked by their suitability to be in the lexicon
- A classifier may also just do one binary decision per term
Incremental lexicon expansion
- After adding new terms to a lexicon, the expansion may be repeated
- A stop condition is then needed to terminate the incremental process
- In NLP, this process is called bootstraping
Bootstrapping process:
- Initialize the lexicon with a set of seed terms
- Use the seed terms to find new terms in some corpus
- Score the new terms, and add the best ones to lexicon
- Go back to Step 2, unless the stop condition is met
Finalizing the lexicon
- Not all terms found during expansion will reliably represent the concept
- Given some measure, a threshold may be used to prune the lexicon
Techniques to finalize a lexicon
- Either, keep all terms from lexicon expansion(and seeds)
- Or, prune the lexicon based on some threshold t of the confidence values of the terms
Confidence values of expanded-lexicon terms
- The scores from lexicon expansion serve as confidence values
- As shown, a candidate´s value may be aggregated from multiple scores.
- The aggregate score may have to be normalized to a defined range.
Confidence value of seed terms?
- Assume we are given a training set where all seed terms w1,…., wk marked mentions of wi under all occurrences of wi.
Benefits and Limitations (Lexicon Acquisition)
Benefits
- A lexicon is an intuitive representation of simple linguistic knowledge
- Big lexicons can be acquired with largely unsupervised methods
- Well-approved techniques exists for acquisition, such as PMI
Limitations
- Coming up with adequate seed terms may be non-straightforward
- Increasing the size of a lexicon usually leads to a decrease in quality
- Lexicons manifest the limitation of focusing on the terms used
Lexicon Matching
- The Identification of concepts in natural language texts, each being represented by a lexicon
- This requires to decide when a matching term refers to a concept
- Main goals include to extract concept instances or a to assess texts
When to use lexicon matching?
- A given lexicon can be used to find all term occurrences in a text
- The existence of a given term in a lexicon can be checked
- The density or distribution of vocabularies in a text can be measured
Attribute extraction:
- The text analysis that extracts attributes of some entity from text
- Input: a text, at least split into tokens
- Output: the list of all extracted attributes
Role in NLP
- Used for tasks such as aspect-based sentiment analysis or the extraction of complex events
Attribute extraction with lexicon matching
Why is lexicon matching not trivial?
- Some terms may represent an attribute but no always
- Some terms are nested in other terms
Approach in a nutshell
- Acquire confidence lexicon based on a collection of reviews
- Choose a threshold t gehört zu [0,1]
- Extract each lexicon term from a text that has a confidence value ≥ t
- Prefer longer terms over shorter terms(and ignore capitalization)
Confidence lexicon:
- A lexicon of attributes where each term is assigned a value gehört zu [0,1]
- The value represents the confidence that a term really is an attribute