Lexicon Flashcards

Question 1

Q

Lexicon

Answer

A

A repository of terms that represents a language, a vocabulary or similar

Why ordering:

For humans: to enable comfortable searching and browsing
For computers: to enable efficient search

Representation of lexicons:

As ordered list for binary search over ordering
As hashsets or hashmaps for direct access to entries
As regular expressions: for use as part of string patterns

Question 2

Q

Types of lexicons

Answer

A

Terms only
Term with definitions:
Terms with information

Question 3

Q

Terms only

Answer

A

Term list: a simple list of terms, used e.g to cover all possible instances of a specific concept
Language lexicon: words along with their stems, affixes, and inflections, used e.g. for morphological analysis
Vocabulary: A list of terms that is known or used in a particular context, use e.g. to cover linguistic style

Question 4

Q

Term with definitions:

Answer

A

Dictionary: a list of terms along with their definitions, grammatical information, and more, could be used to compare term meaning
Glossary: a vocabulary with term definitions, could be used to compare term meaning
Thesaurus: a dictionary of synonyms, with information on related terms, used e.g. to find similar terms

Question 5

Q

Terms with information

Answer

A

Gazetteers: Location names along with metadata, used e.g. as part of entity recognizers
Frequency list: Terms along with their absolute or relative frequency in some text collection, used e.g. to decide what terms to use as machine learning features
Confidence lexicons: Terms along with confidence values(or probabilities) to represent some concept, used e.g. for attribute extraction

Question 6

Q

Selected analysis tasks

Answer

A

disambiguation of punctuation, as in abbreviations
Morphological analysis of words
Attribute extraction(e.g. product aspects)
Entity recognition(e.g time information)
Style analysis of text(e.g. formal vs informal language )
Sentiment analysis of text(e.g. positive vs negative words)
Social bias detection based on social group terms and bias terms

Question 7

Q

Selected generation tasks:

Answer

A

Templated-based generation of texts
Spelling correction of words
Language modeling to predict next words

Question 8

Q

Lexicon Acquisition

Answer

A

Getting seed terms
Expanding the lexicon(possibly incrementally)
Finalizing the lexicon

Question 9

Q

Getting seed terms(Lexicon Acquisition)

Answer

A

he first step is often to come up with a set of initial terms
These terms usually closely relate to the core idea of a given concept

Techniques to get seed terms:

Expert may handcraft an initial list of seed terms
Seed terms may be obtained from an annotation study
Predefined term lists may exist already somewhere

How many seed terms?

The number depends on the concept of interest and on the feasible amount of manual labor
In practice, typical numbers range from a handful to a few hundreds

Question 10

Q

Expanding the lexicon(possibly incrementally)

Answer

A

In many cases, seed terms do not sufficiently cover a give concept
Lexicons may then be expanded by terms related to the seeds

Techniques of expand a lexicon:

Find terms cooccurring with the seeds in a given corpus
compute similarities between seeds and other terms
Train a term classifier on texts with the seeds and apply it

How to use these for expansion?

Many techniques create some numeric score for each candidate term
The terms can thus be ranked by their suitability to be in the lexicon
A classifier may also just do one binary decision per term

Incremental lexicon expansion

After adding new terms to a lexicon, the expansion may be repeated
A stop condition is then needed to terminate the incremental process
In NLP, this process is called bootstraping

Question 11

Q

Bootstrapping process:

Answer

A

Initialize the lexicon with a set of seed terms
Use the seed terms to find new terms in some corpus
Score the new terms, and add the best ones to lexicon
Go back to Step 2, unless the stop condition is met

Question 12

Q

Finalizing the lexicon

Answer

A

Not all terms found during expansion will reliably represent the concept
Given some measure, a threshold may be used to prune the lexicon

Techniques to finalize a lexicon

Either, keep all terms from lexicon expansion(and seeds)
Or, prune the lexicon based on some threshold t of the confidence values of the terms

Confidence values of expanded-lexicon terms

The scores from lexicon expansion serve as confidence values
As shown, a candidate´s value may be aggregated from multiple scores.
The aggregate score may have to be normalized to a defined range.

Confidence value of seed terms?

Assume we are given a training set where all seed terms w1,…., wk marked mentions of wi under all occurrences of wi.

Question 13

Q

Benefits and Limitations (Lexicon Acquisition)

Answer

A

Benefits

A lexicon is an intuitive representation of simple linguistic knowledge
Big lexicons can be acquired with largely unsupervised methods
Well-approved techniques exists for acquisition, such as PMI

Limitations

Coming up with adequate seed terms may be non-straightforward
Increasing the size of a lexicon usually leads to a decrease in quality
Lexicons manifest the limitation of focusing on the terms used

Question 14

Q

Lexicon Matching

Answer

A

The Identification of concepts in natural language texts, each being represented by a lexicon
This requires to decide when a matching term refers to a concept
Main goals include to extract concept instances or a to assess texts

When to use lexicon matching?

A given lexicon can be used to find all term occurrences in a text
The existence of a given term in a lexicon can be checked
The density or distribution of vocabularies in a text can be measured

Attribute extraction:

The text analysis that extracts attributes of some entity from text
Input: a text, at least split into tokens
Output: the list of all extracted attributes

Role in NLP

Used for tasks such as aspect-based sentiment analysis or the extraction of complex events

Question 15

Q

Attribute extraction with lexicon matching

Answer

A

Why is lexicon matching not trivial?

Some terms may represent an attribute but no always
Some terms are nested in other terms

Approach in a nutshell

Acquire confidence lexicon based on a collection of reviews
Choose a threshold t gehört zu [0,1]
Extract each lexicon term from a text that has a confidence value ≥ t
Prefer longer terms over shorter terms(and ignore capitalization)

Confidence lexicon:

A lexicon of attributes where each term is assigned a value gehört zu [0,1]
The value represents the confidence that a term really is an attribute

Question 16

Q

Benefits and Limitations (Lexicon Matching)

Answer

Study These Flashcards

A

Benefits:

Lexicon matching is particularly reliable for unambiguous terms
Lexicons with confidence values allow for trading precision for recall
The idea of matching a lexicon is well-explainable

Limitations:

Information that is not in the employed lexicons can never be found
Ambiguous terms require other methods for disambiguation
Composition of related information is hard to model with lexicons

Lexicon Flashcards

(16 cards)