Lexicon Flashcards

1
Q

Lexicon

A

A repository of terms that represents a language, a vocabulary or similar

Why ordering:

  • For humans: to enable comfortable searching and browsing
  • For computers: to enable efficient search

Representation of lexicons:

  • As ordered list for binary search over ordering
  • As hashsets or hashmaps for direct access to entries
  • As regular expressions: for use as part of string patterns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of lexicons

A

Terms only
Term with definitions:
Terms with information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Terms only

A
  • Term list: a simple list of terms, used e.g to cover all possible instances of a specific concept
  • Language lexicon: words along with their stems, affixes, and inflections, used e.g. for morphological analysis
  • Vocabulary: A list of terms that is known or used in a particular context, use e.g. to cover linguistic style
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Term with definitions:

A
  • Dictionary: a list of terms along with their definitions, grammatical information, and more, could be used to compare term meaning
  • Glossary: a vocabulary with term definitions, could be used to compare term meaning
  • Thesaurus: a dictionary of synonyms, with information on related terms, used e.g. to find similar terms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Terms with information

A
  • Gazetteers: Location names along with metadata, used e.g. as part of entity recognizers
  • Frequency list: Terms along with their absolute or relative frequency in some text collection, used e.g. to decide what terms to use as machine learning features
  • Confidence lexicons: Terms along with confidence values(or probabilities) to represent some concept, used e.g. for attribute extraction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Selected analysis tasks

A
  • disambiguation of punctuation, as in abbreviations
  • Morphological analysis of words
  • Attribute extraction(e.g. product aspects)
  • Entity recognition(e.g time information)
  • Style analysis of text(e.g. formal vs informal language )
  • Sentiment analysis of text(e.g. positive vs negative words)
  • Social bias detection based on social group terms and bias terms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Selected generation tasks:

A
  • Templated-based generation of texts
  • Spelling correction of words
  • Language modeling to predict next words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Lexicon Acquisition

A

Getting seed terms
Expanding the lexicon(possibly incrementally)
Finalizing the lexicon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Getting seed terms(Lexicon Acquisition)

A
  • he first step is often to come up with a set of initial terms
  • These terms usually closely relate to the core idea of a given concept

Techniques to get seed terms:

  • Expert may handcraft an initial list of seed terms
  • Seed terms may be obtained from an annotation study
  • Predefined term lists may exist already somewhere

How many seed terms?

  • The number depends on the concept of interest and on the feasible amount of manual labor
  • In practice, typical numbers range from a handful to a few hundreds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Expanding the lexicon(possibly incrementally)

A
  • In many cases, seed terms do not sufficiently cover a give concept
  • Lexicons may then be expanded by terms related to the seeds

Techniques of expand a lexicon:

  • Find terms cooccurring with the seeds in a given corpus
  • compute similarities between seeds and other terms
  • Train a term classifier on texts with the seeds and apply it

How to use these for expansion?

  • Many techniques create some numeric score for each candidate term
  • The terms can thus be ranked by their suitability to be in the lexicon
  • A classifier may also just do one binary decision per term

Incremental lexicon expansion

  • After adding new terms to a lexicon, the expansion may be repeated
  • A stop condition is then needed to terminate the incremental process
  • In NLP, this process is called bootstraping
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Bootstrapping process:

A
  1. Initialize the lexicon with a set of seed terms
  2. Use the seed terms to find new terms in some corpus
  3. Score the new terms, and add the best ones to lexicon
  4. Go back to Step 2, unless the stop condition is met
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Finalizing the lexicon

A
  • Not all terms found during expansion will reliably represent the concept
  • Given some measure, a threshold may be used to prune the lexicon

Techniques to finalize a lexicon

  • Either, keep all terms from lexicon expansion(and seeds)
  • Or, prune the lexicon based on some threshold t of the confidence values of the terms

Confidence values of expanded-lexicon terms

  • The scores from lexicon expansion serve as confidence values
  • As shown, a candidate´s value may be aggregated from multiple scores.
  • The aggregate score may have to be normalized to a defined range.

Confidence value of seed terms?

  • Assume we are given a training set where all seed terms w1,…., wk marked mentions of wi under all occurrences of wi.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Benefits and Limitations (Lexicon Acquisition)

A

Benefits

  • A lexicon is an intuitive representation of simple linguistic knowledge
  • Big lexicons can be acquired with largely unsupervised methods
  • Well-approved techniques exists for acquisition, such as PMI

Limitations

  • Coming up with adequate seed terms may be non-straightforward
  • Increasing the size of a lexicon usually leads to a decrease in quality
  • Lexicons manifest the limitation of focusing on the terms used
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Lexicon Matching

A
  • The Identification of concepts in natural language texts, each being represented by a lexicon
  • This requires to decide when a matching term refers to a concept
  • Main goals include to extract concept instances or a to assess texts

When to use lexicon matching?

  1. A given lexicon can be used to find all term occurrences in a text
  2. The existence of a given term in a lexicon can be checked
  3. The density or distribution of vocabularies in a text can be measured

Attribute extraction:

  • The text analysis that extracts attributes of some entity from text
  • Input: a text, at least split into tokens
  • Output: the list of all extracted attributes

Role in NLP

  • Used for tasks such as aspect-based sentiment analysis or the extraction of complex events
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Attribute extraction with lexicon matching

A

Why is lexicon matching not trivial?

  • Some terms may represent an attribute but no always
  • Some terms are nested in other terms

Approach in a nutshell

  1. Acquire confidence lexicon based on a collection of reviews
  2. Choose a threshold t gehört zu [0,1]
  3. Extract each lexicon term from a text that has a confidence value ≥ t
  4. Prefer longer terms over shorter terms(and ignore capitalization)

Confidence lexicon:

  • A lexicon of attributes where each term is assigned a value gehört zu [0,1]
  • The value represents the confidence that a term really is an attribute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Benefits and Limitations (Lexicon Matching)

A

Benefits:

  • Lexicon matching is particularly reliable for unambiguous terms
  • Lexicons with confidence values allow for trading precision for recall
  • The idea of matching a lexicon is well-explainable

Limitations:

  • Information that is not in the employed lexicons can never be found
  • Ambiguous terms require other methods for disambiguation
  • Composition of related information is hard to model with lexicons