Practical Issues Flashcards

1
Q

Main Issues in NLP

A
  • Low effectiveness, due to data or approach limitations
  • Low efficiency, due to high run-time or memory consumption
  • Low robustness, due to domain-specific development
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

NLP Processes

A
  • A single NLP algorithm usually realizes a method that infers one type of information from text-or generates one type of text
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why NLP processes?

A
  • Many algorithms require as input the output of other methods, which in turn depend on further methods, and so forth
    • “An entity recognizer may need part-of-speech tagging, which needs tokenization,…
  • Even a single type of output may require several methods
  • Most real-world NLP tasks aim at combinations of different types, such as those from information extraction.
  • Due to the interdepencies, the standard approach to realize a process is in the form of an algorithm pepeline
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Algorithm pipeline:

A
  • A set of algorithms along with a schedule that defines the order of algorithm application
  • Each algorithm takes as input a text and the output of all preceding algorithms, and it produces further output.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Pipeline scheduling

A
  • The input requirements of each algorithm need to be fulfilled
  • Some algorithms are independent, i.e., they have no defined ordering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Reasons for limited effectiveness

A
  • Ambiguity of natural language
  • Missing context and world knowledge
  • Process-related reasons: lack of training data, domain transfer, error accumulation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Perfect effectiveness?

A
  • Noisy texts, errors in test data, subjective tasks, etc.
  • Only trivial tasks can generally be solved perfectly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Process-related reasons for Limited effectiveness

A

Lack of training data:

  • Training data may often not suffice to make a given approach effective
  • If more data cannot be acquired, one may resort to simpler techniques

Domain transfer of an approach

  • Approaches may fail on data very different from the training data
  • Way out include heterogeneous training data and domain adaptation

Error accumulation

  • Errors propagate through an algorithm pipeline, since the output of one algorithm serves as input to subsequent ones
  • In standard pipelines, algorithms cannot fix errors of predecessors
  • Even when each algorithm works well, overall effectiveness may be low
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Strategies to counter error accumulation

A

Joint inference algorithms

  • Infer multiple information types simultaneously, in order to find the optimal solution over all types
  • Knowledge from each task can be exploited for the others
    • Named entity recognition: Avoid confusion between different entity types
    • Argument mining: Segment and classify argument units in one step
  • This reduces run-time efficiency notably and limits reusability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Pipeline extensions

A
  • Iterative pipelines: Repeat pipeline execution and use the output of later algorithms to improve the output of earlier ones
  • Probabilistic pipelines: Optimize a probability model based on different possible outputs and/or confidence values of each algorithm
  • Both require modifications of algorithms and notably reduce efficiency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Practical effectiveness tweaks

A
  • Exploiting domain knowledge
    • Rule of thumb: The narrower the domain, the higher the effectiveness
    • Encoding domain-specific knowledge is important in practice
    • In-domain training is often a must for high effectiveness
  • Combining statistics and rules
    • Real-world NLP applications mostly combine statistical learning with hand-crafted rules
    • Rules are derived from a manual review of uncertain and difficult cases
  • Scaling up
    • At large scale, precision can be preferred over recall, assuming that the information sought for appears multiple times
    • A smart use of redundancy increases confidence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Reasons for limited efficiency

A
  • NLP pipelines often includes several time-intensive algorithms
  • Large amounts of data may need to be processed, possibly repeatedly
  • Much information may be stored during processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Ways to improve memory efficiency

A
  • Scaling up is the natural solution to higher memory needs
  • Als, debugging(and minimizing) what information is stored may help
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Ways to improve run-time efficiency

A
  • Indexing of relevant information
  • Resort to simpler NLP algorithms
  • Filtering and scheduling in pipelines
  • Parallelization of NLP processes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Potential memory efficiency issues

A

Memory consumption in NLP

  • Permanent and temporal storage of input texts and output information
  • Storage of algorithms and models during execution

Storage of inputs and outputs

  • Single input texts are usually small in NLP
  • Output information is negligible compared to input
  • The main problem may be the permanent storage of full text corpora

Storage of algorithms

  • Memory consumption may add up in longer text analysis pipelines
  • Machine learning brings up further challenges due to huge models
  • In both cases, powerful machines are needed-and/or parallelization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Indexing of relevant information

A
  • In applications such as web search, the same information may have to be obtained multiple times from a text
  • By storing and indexing information beforehand, the need for ad-hoc NLP can be avoided
  • Naturally, this is restricted to anticipated information needs
  • Also, it implies a trade-off between run-time and memory efficiency
17
Q

Simpler algorithms

A
  • A natural way to improve run-time is to use simpler but faster algorithms
  • Large efficiency gains possible
  • At large scale, high effectiveness is possible via redundancy and precision focus
18
Q

Filtering relevant portions of text

A
  • Standard pipelines apply each algorithm to the whole input text
  • For a given NLP task, not all portions of a text are relevant
  • After each step, irrelevant portions can be filtered out
  • The size of the portions trades efficiency for effectiveness
19
Q

Optimal scheduling of pipelines

A
  • With filtering, the schedule of a pipeline`s algortihms affects efficiency
  • Schedule optimization is a dynamic programming problem based on the run-times and “filter rates” of the algorithms

Intuition

  • Filter out many portions of text early
  • Schedule expensive algorithms late

Effects

  • Efficiency may be improved by more than an order of magnitude
  • If filtering matches the level on which the algorithms operate, effectiveness is maintained
20
Q

Parallelization

A

Text analysis entails “natural” parallelization

  • Input texts are usually analyzed in isolation, allowing their distribution

Basic parallelization scenario

  • One master machine, many slaves
  • Master sends input to slaves
  • Slaves process input and produce output
  • Master aggregates output

Homogeneous parallel system architecture

  • All machines comparable in terms of speed etc.
  • No specialized hardware
21
Q

Discussion of the parallelization approaches

A

Analysis pipelining

  • Pro: Low memory consumption, lower execution time
  • Con. Not foult-tolerant, high network overhead, machine idle times

Analysis parallelization

  • Pro: Low memory consumption, possibly lower execution time
  • Con: not foult tolerant, network overhead, high machine idle times

Pipeline duplication:

  • Pro: Very fault-tolerant, no idle times, much lower execution time
  • Con: Full memory consumption on every slave

Schedule parallelization

  • Pro: Fault-tolerant, few idle times, lower memory consumption, much lower execution time
  • Con: Some network overhead, more complex process control