Practical Issues Flashcards
1
Q
Main Issues in NLP
A
- Low effectiveness, due to data or approach limitations
- Low efficiency, due to high run-time or memory consumption
- Low robustness, due to domain-specific development
2
Q
NLP Processes
A
- A single NLP algorithm usually realizes a method that infers one type of information from text-or generates one type of text
3
Q
Why NLP processes?
A
- Many algorithms require as input the output of other methods, which in turn depend on further methods, and so forth
- “An entity recognizer may need part-of-speech tagging, which needs tokenization,…
- Even a single type of output may require several methods
- Most real-world NLP tasks aim at combinations of different types, such as those from information extraction.
- Due to the interdepencies, the standard approach to realize a process is in the form of an algorithm pepeline
4
Q
Algorithm pipeline:
A
- A set of algorithms along with a schedule that defines the order of algorithm application
- Each algorithm takes as input a text and the output of all preceding algorithms, and it produces further output.
5
Q
Pipeline scheduling
A
- The input requirements of each algorithm need to be fulfilled
- Some algorithms are independent, i.e., they have no defined ordering
6
Q
Reasons for limited effectiveness
A
- Ambiguity of natural language
- Missing context and world knowledge
- Process-related reasons: lack of training data, domain transfer, error accumulation
7
Q
Perfect effectiveness?
A
- Noisy texts, errors in test data, subjective tasks, etc.
- Only trivial tasks can generally be solved perfectly
8
Q
Process-related reasons for Limited effectiveness
A
Lack of training data:
- Training data may often not suffice to make a given approach effective
- If more data cannot be acquired, one may resort to simpler techniques
Domain transfer of an approach
- Approaches may fail on data very different from the training data
- Way out include heterogeneous training data and domain adaptation
Error accumulation
- Errors propagate through an algorithm pipeline, since the output of one algorithm serves as input to subsequent ones
- In standard pipelines, algorithms cannot fix errors of predecessors
- Even when each algorithm works well, overall effectiveness may be low
9
Q
Strategies to counter error accumulation
A
Joint inference algorithms
- Infer multiple information types simultaneously, in order to find the optimal solution over all types
- Knowledge from each task can be exploited for the others
- Named entity recognition: Avoid confusion between different entity types
- Argument mining: Segment and classify argument units in one step
- This reduces run-time efficiency notably and limits reusability
10
Q
Pipeline extensions
A
- Iterative pipelines: Repeat pipeline execution and use the output of later algorithms to improve the output of earlier ones
- Probabilistic pipelines: Optimize a probability model based on different possible outputs and/or confidence values of each algorithm
- Both require modifications of algorithms and notably reduce efficiency
11
Q
Practical effectiveness tweaks
A
- Exploiting domain knowledge
- Rule of thumb: The narrower the domain, the higher the effectiveness
- Encoding domain-specific knowledge is important in practice
- In-domain training is often a must for high effectiveness
- Combining statistics and rules
- Real-world NLP applications mostly combine statistical learning with hand-crafted rules
- Rules are derived from a manual review of uncertain and difficult cases
- Scaling up
- At large scale, precision can be preferred over recall, assuming that the information sought for appears multiple times
- A smart use of redundancy increases confidence
12
Q
Reasons for limited efficiency
A
- NLP pipelines often includes several time-intensive algorithms
- Large amounts of data may need to be processed, possibly repeatedly
- Much information may be stored during processing
13
Q
Ways to improve memory efficiency
A
- Scaling up is the natural solution to higher memory needs
- Als, debugging(and minimizing) what information is stored may help
14
Q
Ways to improve run-time efficiency
A
- Indexing of relevant information
- Resort to simpler NLP algorithms
- Filtering and scheduling in pipelines
- Parallelization of NLP processes
15
Q
Potential memory efficiency issues
A
Memory consumption in NLP
- Permanent and temporal storage of input texts and output information
- Storage of algorithms and models during execution
Storage of inputs and outputs
- Single input texts are usually small in NLP
- Output information is negligible compared to input
- The main problem may be the permanent storage of full text corpora
Storage of algorithms
- Memory consumption may add up in longer text analysis pipelines
- Machine learning brings up further challenges due to huge models
- In both cases, powerful machines are needed-and/or parallelization