Practical Issues Flashcards
Main Issues in NLP
- Low effectiveness, due to data or approach limitations
- Low efficiency, due to high run-time or memory consumption
- Low robustness, due to domain-specific development
NLP Processes
- A single NLP algorithm usually realizes a method that infers one type of information from text-or generates one type of text
Why NLP processes?
- Many algorithms require as input the output of other methods, which in turn depend on further methods, and so forth
- “An entity recognizer may need part-of-speech tagging, which needs tokenization,…
- Even a single type of output may require several methods
- Most real-world NLP tasks aim at combinations of different types, such as those from information extraction.
- Due to the interdepencies, the standard approach to realize a process is in the form of an algorithm pepeline
Algorithm pipeline:
- A set of algorithms along with a schedule that defines the order of algorithm application
- Each algorithm takes as input a text and the output of all preceding algorithms, and it produces further output.
Pipeline scheduling
- The input requirements of each algorithm need to be fulfilled
- Some algorithms are independent, i.e., they have no defined ordering
Reasons for limited effectiveness
- Ambiguity of natural language
- Missing context and world knowledge
- Process-related reasons: lack of training data, domain transfer, error accumulation
Perfect effectiveness?
- Noisy texts, errors in test data, subjective tasks, etc.
- Only trivial tasks can generally be solved perfectly
Process-related reasons for Limited effectiveness
Lack of training data:
- Training data may often not suffice to make a given approach effective
- If more data cannot be acquired, one may resort to simpler techniques
Domain transfer of an approach
- Approaches may fail on data very different from the training data
- Way out include heterogeneous training data and domain adaptation
Error accumulation
- Errors propagate through an algorithm pipeline, since the output of one algorithm serves as input to subsequent ones
- In standard pipelines, algorithms cannot fix errors of predecessors
- Even when each algorithm works well, overall effectiveness may be low
Strategies to counter error accumulation
Joint inference algorithms
- Infer multiple information types simultaneously, in order to find the optimal solution over all types
- Knowledge from each task can be exploited for the others
- Named entity recognition: Avoid confusion between different entity types
- Argument mining: Segment and classify argument units in one step
- This reduces run-time efficiency notably and limits reusability
Pipeline extensions
- Iterative pipelines: Repeat pipeline execution and use the output of later algorithms to improve the output of earlier ones
- Probabilistic pipelines: Optimize a probability model based on different possible outputs and/or confidence values of each algorithm
- Both require modifications of algorithms and notably reduce efficiency
Practical effectiveness tweaks
- Exploiting domain knowledge
- Rule of thumb: The narrower the domain, the higher the effectiveness
- Encoding domain-specific knowledge is important in practice
- In-domain training is often a must for high effectiveness
- Combining statistics and rules
- Real-world NLP applications mostly combine statistical learning with hand-crafted rules
- Rules are derived from a manual review of uncertain and difficult cases
- Scaling up
- At large scale, precision can be preferred over recall, assuming that the information sought for appears multiple times
- A smart use of redundancy increases confidence
Reasons for limited efficiency
- NLP pipelines often includes several time-intensive algorithms
- Large amounts of data may need to be processed, possibly repeatedly
- Much information may be stored during processing
Ways to improve memory efficiency
- Scaling up is the natural solution to higher memory needs
- Als, debugging(and minimizing) what information is stored may help
Ways to improve run-time efficiency
- Indexing of relevant information
- Resort to simpler NLP algorithms
- Filtering and scheduling in pipelines
- Parallelization of NLP processes
Potential memory efficiency issues
Memory consumption in NLP
- Permanent and temporal storage of input texts and output information
- Storage of algorithms and models during execution
Storage of inputs and outputs
- Single input texts are usually small in NLP
- Output information is negligible compared to input
- The main problem may be the permanent storage of full text corpora
Storage of algorithms
- Memory consumption may add up in longer text analysis pipelines
- Machine learning brings up further challenges due to huge models
- In both cases, powerful machines are needed-and/or parallelization
Indexing of relevant information
- In applications such as web search, the same information may have to be obtained multiple times from a text
- By storing and indexing information beforehand, the need for ad-hoc NLP can be avoided
- Naturally, this is restricted to anticipated information needs
- Also, it implies a trade-off between run-time and memory efficiency
Simpler algorithms
- A natural way to improve run-time is to use simpler but faster algorithms
- Large efficiency gains possible
- At large scale, high effectiveness is possible via redundancy and precision focus
Filtering relevant portions of text
- Standard pipelines apply each algorithm to the whole input text
- For a given NLP task, not all portions of a text are relevant
- After each step, irrelevant portions can be filtered out
- The size of the portions trades efficiency for effectiveness
Optimal scheduling of pipelines
- With filtering, the schedule of a pipeline`s algortihms affects efficiency
- Schedule optimization is a dynamic programming problem based on the run-times and “filter rates” of the algorithms
Intuition
- Filter out many portions of text early
- Schedule expensive algorithms late
Effects
- Efficiency may be improved by more than an order of magnitude
- If filtering matches the level on which the algorithms operate, effectiveness is maintained
Parallelization
Text analysis entails “natural” parallelization
- Input texts are usually analyzed in isolation, allowing their distribution
Basic parallelization scenario
- One master machine, many slaves
- Master sends input to slaves
- Slaves process input and produce output
- Master aggregates output
Homogeneous parallel system architecture
- All machines comparable in terms of speed etc.
- No specialized hardware
Discussion of the parallelization approaches
Analysis pipelining
- Pro: Low memory consumption, lower execution time
- Con. Not foult-tolerant, high network overhead, machine idle times
Analysis parallelization
- Pro: Low memory consumption, possibly lower execution time
- Con: not foult tolerant, network overhead, high machine idle times
Pipeline duplication:
- Pro: Very fault-tolerant, no idle times, much lower execution time
- Con: Full memory consumption on every slave
Schedule parallelization
- Pro: Fault-tolerant, few idle times, lower memory consumption, much lower execution time
- Con: Some network overhead, more complex process control