Empirical Methods Flashcards
Quantitative Methods
Characterized by objective measurements
Qualitative Methods
Emphasizes the understanding of human experience
Descriptive statistic
Methods for summarizing a sample or a distribution of value; used to describe phenomena
Inferential statistic
Methods for drawing conclusions based on values; used to generalize inferences beyond a given sample: The average number is significantly higher than 5
Elements of empirical methods in NLP
- Evaluation measures: Quantification of the quality of a method, especially its effectiveness
- Empirical experiments: Evaluation of the quality on text corpora and comparison to alternative methods
- Hypothesis testing: Use of statistical methods to “proof” the quality of a method in comparison to others
Evaluation measures
Effectiveness:
- The extent to which the output information of an approach is correct
- High effectiveness is the primary goal of any NLP method
- Classification measures: Accuracy precision, recall, F1-Score,….
- Regression measures: mean absolute/squared error,…
Eficiency:
- The costs of a method in terms of the consumption of time or space
- Measures: Run-time, training time, memory consumption,
Classification Effectiveness
- The instances of each class can be evaluated in a binary manner
- For each instance, check whether its class matches the ground truth
- Positives: the class instances a given approach has inferred
- Negatives: all other possible instances
Instance types in the evaluation:
- True positive(TP): a positive that belongs to the ground truth
- False positive(FP): a positive that does not belong to the ground truth
- False negative (FN): a negative that belongs to the ground truth
- True negative (TN): a negative that does not belong to the ground truth
When to use accuracy?
- Accuracy is adequate when all classes are of similar importance
- Examples: Sentiment analysis, part of speech tagging,…
When not to use accuracy?
- n tasks where one class is rare, high accuracy can be achieved by never predicting the class
- 4% spam -> 96% accuracy by always predicting “no spam”
- This includes tasks where the correct output information covers only portions of text, such as in entity recognition
- “Apples rocks” → Negatives: “A”, “Ap”, “App”,…
- Accuracy is inadequate when true negatives are of low importance
Precision
- The precision is a measure of the exactness of an approach
- P answers: How many of the found instances are correct?
Recall
- The recall R is a measure of the completeness of an approach
- R answers: How many of the correct instances have been found?
F1-score
- The F1-score is the harmonic mean of precision and recall
- F1 favors balanced over imbalanced precision and recall values
Boundary errors and Issues
A common error in tasks where text spans need to be annotated is to choose a wrong boundary of the span
Issues
- leads to both an FP und an FN
- Identifying nothing as positive would increase the F1-score
How to deal with boundary errors
- Different accounts for the issue have been proposed, but the standard F1 is still used in most evaluations
- A relaxed evaluation is to consider some character overlap instead of exact boundaries
Evaluation of multi-class tasks
- n general, each class in a multi-class task can be evaluated binarily.
- Accuracy can be computed for any number k of classes
- The other measures must be combined with micro- or macro-averaging
Micro-averaged precision
Micro-averaging takes into account the number of instances per class, so larger classes get more importance
Macro-averaged precision:
Macro-averaging computes the mean result over all classes, so each class gets the same importance
Confusion matrix
- each row refers to the ground-truth instances of one of k classes
- Each column refers to the classified instances of one class
- The cells contain the numbers of correct and incorrect classifications of a given approach
Why confusion matrices
- Used to analyze errors, to see classes are confused
- Contains all values for computing micro- and macro-averaged results
Types of prediction errors
- Mean absolute error(MAE)
- The mean difference of predicted to ground-truth values
- The MEA is robust to outliers, i.e., it does not treat them specially
- Mean squared error(MSE)
- The mean squared difference of predicted to ground-truth values
- The MSE is specifically sensitive to outliers
Sometimes, also the root mean squared error (RMSE) is computed, defined as RMSE = Sqrt(MSE)
Empirical Experiments:
- An empirical experiment tests a hypothesis based on observatinos
- The focus is here on effectiveness evaluation in NLP
Intrinsic vs extrinsic effectiveness evaluation:
- Intrinsic: the effectiveness of an approach is directly evaluated on the task it is made for:
- What accuracy does a part-speech tagger XY have on the dataset D?
- Extrinsic: the effectiveness of an approach is evaluated by measuring how effective its output is in a downstream task
- Does the output of XY improve sentiment analysis on D?