Practical Issues Flashcards

Question 1

Q

Main Issues in NLP

Answer

A

Low effectiveness, due to data or approach limitations
Low efficiency, due to high run-time or memory consumption
Low robustness, due to domain-specific development

Question 2

Q

NLP Processes

Answer

A

A single NLP algorithm usually realizes a method that infers one type of information from text-or generates one type of text

Question 3

Q

Why NLP processes?

Answer

A

Many algorithms require as input the output of other methods, which in turn depend on further methods, and so forth
- “An entity recognizer may need part-of-speech tagging, which needs tokenization,…
Even a single type of output may require several methods
Most real-world NLP tasks aim at combinations of different types, such as those from information extraction.
Due to the interdepencies, the standard approach to realize a process is in the form of an algorithm pepeline

Question 4

Q

Algorithm pipeline:

Answer

A

A set of algorithms along with a schedule that defines the order of algorithm application
Each algorithm takes as input a text and the output of all preceding algorithms, and it produces further output.

Question 5

Q

Pipeline scheduling

Answer

A

The input requirements of each algorithm need to be fulfilled
Some algorithms are independent, i.e., they have no defined ordering

Question 6

Q

Reasons for limited effectiveness

Answer

A

Ambiguity of natural language
Missing context and world knowledge
Process-related reasons: lack of training data, domain transfer, error accumulation

Question 7

Q

Perfect effectiveness?

Answer

A

Noisy texts, errors in test data, subjective tasks, etc.
Only trivial tasks can generally be solved perfectly

Question 8

Q

Process-related reasons for Limited effectiveness

Answer

A

Lack of training data:

Training data may often not suffice to make a given approach effective
If more data cannot be acquired, one may resort to simpler techniques

Domain transfer of an approach

Approaches may fail on data very different from the training data
Way out include heterogeneous training data and domain adaptation

Error accumulation

Errors propagate through an algorithm pipeline, since the output of one algorithm serves as input to subsequent ones
In standard pipelines, algorithms cannot fix errors of predecessors
Even when each algorithm works well, overall effectiveness may be low

Question 9

Q

Strategies to counter error accumulation

Answer

A

Joint inference algorithms

Infer multiple information types simultaneously, in order to find the optimal solution over all types
Knowledge from each task can be exploited for the others
- Named entity recognition: Avoid confusion between different entity types
- Argument mining: Segment and classify argument units in one step
This reduces run-time efficiency notably and limits reusability

Question 10

Q

Pipeline extensions

Answer

A

Iterative pipelines: Repeat pipeline execution and use the output of later algorithms to improve the output of earlier ones
Probabilistic pipelines: Optimize a probability model based on different possible outputs and/or confidence values of each algorithm
Both require modifications of algorithms and notably reduce efficiency

Question 11

Q

Practical effectiveness tweaks

Answer

A

Exploiting domain knowledge
- Rule of thumb: The narrower the domain, the higher the effectiveness
- Encoding domain-specific knowledge is important in practice
- In-domain training is often a must for high effectiveness
Combining statistics and rules
- Real-world NLP applications mostly combine statistical learning with hand-crafted rules
- Rules are derived from a manual review of uncertain and difficult cases
Scaling up
- At large scale, precision can be preferred over recall, assuming that the information sought for appears multiple times
- A smart use of redundancy increases confidence

Question 12

Q

Reasons for limited efficiency

Answer

A

NLP pipelines often includes several time-intensive algorithms
Large amounts of data may need to be processed, possibly repeatedly
Much information may be stored during processing

Question 13

Q

Ways to improve memory efficiency

Answer

A

Scaling up is the natural solution to higher memory needs
Als, debugging(and minimizing) what information is stored may help

Question 14

Q

Ways to improve run-time efficiency

Answer

A

Indexing of relevant information
Resort to simpler NLP algorithms
Filtering and scheduling in pipelines
Parallelization of NLP processes

Question 15

Q

Potential memory efficiency issues

Answer

A

Memory consumption in NLP

Permanent and temporal storage of input texts and output information
Storage of algorithms and models during execution

Storage of inputs and outputs

Single input texts are usually small in NLP
Output information is negligible compared to input
The main problem may be the permanent storage of full text corpora

Storage of algorithms

Memory consumption may add up in longer text analysis pipelines
Machine learning brings up further challenges due to huge models
In both cases, powerful machines are needed-and/or parallelization

Question 16

Q

Indexing of relevant information

Answer

Study These Flashcards

A

In applications such as web search, the same information may have to be obtained multiple times from a text
By storing and indexing information beforehand, the need for ad-hoc NLP can be avoided
Naturally, this is restricted to anticipated information needs
Also, it implies a trade-off between run-time and memory efficiency

Question 17

Q

Simpler algorithms

Answer

Study These Flashcards

A

A natural way to improve run-time is to use simpler but faster algorithms
Large efficiency gains possible
At large scale, high effectiveness is possible via redundancy and precision focus

Question 18

Q

Filtering relevant portions of text

Answer

Study These Flashcards

A

Standard pipelines apply each algorithm to the whole input text
For a given NLP task, not all portions of a text are relevant
After each step, irrelevant portions can be filtered out
The size of the portions trades efficiency for effectiveness

Question 19

Q

Optimal scheduling of pipelines

Answer

Study These Flashcards

A

With filtering, the schedule of a pipeline`s algortihms affects efficiency
Schedule optimization is a dynamic programming problem based on the run-times and “filter rates” of the algorithms

Intuition

Filter out many portions of text early
Schedule expensive algorithms late

Effects

Efficiency may be improved by more than an order of magnitude
If filtering matches the level on which the algorithms operate, effectiveness is maintained

Question 20

Q

Parallelization

Answer

Study These Flashcards

A

Text analysis entails “natural” parallelization

Input texts are usually analyzed in isolation, allowing their distribution

Basic parallelization scenario

One master machine, many slaves
Master sends input to slaves
Slaves process input and produce output
Master aggregates output

Homogeneous parallel system architecture

All machines comparable in terms of speed etc.
No specialized hardware

Question 21

Q

Discussion of the parallelization approaches

Answer

Study These Flashcards

A

Analysis pipelining

Pro: Low memory consumption, lower execution time
Con. Not foult-tolerant, high network overhead, machine idle times

Analysis parallelization

Pro: Low memory consumption, possibly lower execution time
Con: not foult tolerant, network overhead, high machine idle times

Pipeline duplication:

Pro: Very fault-tolerant, no idle times, much lower execution time
Con: Full memory consumption on every slave

Schedule parallelization

Pro: Fault-tolerant, few idle times, lower memory consumption, much lower execution time
Con: Some network overhead, more complex process control

Practical Issues Flashcards

(21 cards)