Empirical Methods Flashcards
Quantitative Methods
Characterized by objective measurements
Qualitative Methods
Emphasizes the understanding of human experience
Descriptive statistic
Methods for summarizing a sample or a distribution of value; used to describe phenomena
Inferential statistic
Methods for drawing conclusions based on values; used to generalize inferences beyond a given sample: The average number is significantly higher than 5
Elements of empirical methods in NLP
- Evaluation measures: Quantification of the quality of a method, especially its effectiveness
- Empirical experiments: Evaluation of the quality on text corpora and comparison to alternative methods
- Hypothesis testing: Use of statistical methods to “proof” the quality of a method in comparison to others
Evaluation measures
- The extent to which the output information of an approach is correct
- High effectiveness is the primary goal of any NLP method
- Classification measures: Accuracy precision, recall, F1-Score,….
- Regression measures: mean absolute/squared error,…
- The costs of a method in terms of the consumption of time or space
- Measures: Run-time, training time, memory consumption,
Classification Effectiveness
- The instances of each class can be evaluated in a binary manner
- For each instance, check whether its class matches the ground truth
- Positives: the class instances a given approach has inferred
- Negatives: all other possible instances
Instance types in the evaluation:
- True positive(TP): a positive that belongs to the ground truth
- False positive(FP): a positive that does not belong to the ground truth
- False negative (FN): a negative that belongs to the ground truth
- True negative (TN): a negative that does not belong to the ground truth
When to use accuracy?
- Accuracy is adequate when all classes are of similar importance
- Examples: Sentiment analysis, part of speech tagging,…
When not to use accuracy?
- n tasks where one class is rare, high accuracy can be achieved by never predicting the class
- 4% spam -> 96% accuracy by always predicting “no spam”
- This includes tasks where the correct output information covers only portions of text, such as in entity recognition
- “Apples rocks” → Negatives: “A”, “Ap”, “App”,…
- Accuracy is inadequate when true negatives are of low importance
- The precision is a measure of the exactness of an approach
- P answers: How many of the found instances are correct?
- The recall R is a measure of the completeness of an approach
- R answers: How many of the correct instances have been found?
- The F1-score is the harmonic mean of precision and recall
- F1 favors balanced over imbalanced precision and recall values
Boundary errors and Issues
A common error in tasks where text spans need to be annotated is to choose a wrong boundary of the span
- leads to both an FP und an FN
- Identifying nothing as positive would increase the F1-score
How to deal with boundary errors
- Different accounts for the issue have been proposed, but the standard F1 is still used in most evaluations
- A relaxed evaluation is to consider some character overlap instead of exact boundaries
Evaluation of multi-class tasks
- n general, each class in a multi-class task can be evaluated binarily.
- Accuracy can be computed for any number k of classes
- The other measures must be combined with micro- or macro-averaging
Micro-averaged precision
Micro-averaging takes into account the number of instances per class, so larger classes get more importance
Macro-averaged precision:
Macro-averaging computes the mean result over all classes, so each class gets the same importance
Confusion matrix
- each row refers to the ground-truth instances of one of k classes
- Each column refers to the classified instances of one class
- The cells contain the numbers of correct and incorrect classifications of a given approach
Why confusion matrices
- Used to analyze errors, to see classes are confused
- Contains all values for computing micro- and macro-averaged results
Types of prediction errors
- Mean absolute error(MAE)
- The mean difference of predicted to ground-truth values
- The MEA is robust to outliers, i.e., it does not treat them specially
- Mean squared error(MSE)
- The mean squared difference of predicted to ground-truth values
- The MSE is specifically sensitive to outliers
Sometimes, also the root mean squared error (RMSE) is computed, defined as RMSE = Sqrt(MSE)
Empirical Experiments:
- An empirical experiment tests a hypothesis based on observatinos
- The focus is here on effectiveness evaluation in NLP
Intrinsic vs extrinsic effectiveness evaluation:
- Intrinsic: the effectiveness of an approach is directly evaluated on the task it is made for:
- What accuracy does a part-speech tagger XY have on the dataset D?
- Extrinsic: the effectiveness of an approach is evaluated by measuring how effective its output is in a downstream task
- Does the output of XY improve sentiment analysis on D?
Text corpora
- A principled collection of natural language texts with know properties compiled to study a language problem
- The texts in a corpus are often annotated, at least for the problem to be studied
Need for text corpora:
- NLP approaches are developed and evaluated on text corpora
- Without a corpus, it`s hard to develop a strong approach and impossible to reliably evaluate it
- An annotation marks a text or a span of text as representing meta-information of a specific type
- It may also be used to specify relations between other annotations
- The types are specified by an annotation scheme
Types of Annotations of corpora
Annotated corpora in NLP:
- Usually, a corpus contains annotations of information types of interest in a task or domain
Manual annotations
- The annotations of a text corpus are usually created manually
- To assess the quality of manual annotations, inter-annotator agreement is computed based on texts annotated multiple times
Ground-truth annotations:
- Manual annotations assumed to be correct are called the ground truth
- NLP usually learns from ground-truth annotations
Automatic annotation:
- Technically, NLP algorithms can be seen as just adding annotations of certain types to a processed text
- The automatic process usually aims to mimic the manual process
Development and Evaluation
Dataset: A sub-corpus of a corpus that is compiled and used for developing and/or evaluating approaches to specific tasks
Development and evaluation based on datasets:
- An approach is developed based on a set of training instances
- The approach is applied to a set of test instances
- The output of the approach is compared to the ground truth of the test instances using evaluation measures
- Steps 1-3 may be iteratively repeated to improve the approach
Corpus splitting
- The split of a corpus into datasets should represent the task well
- The way a corpus is split implies how to evaluate
- Main evaluations types: (Training, validation and test) vs cross-validation
Training, Validation and Test
- Training set:
- Known instances used to develop or statistically learn an approach
- The training set may be analyzed manually and automatically
- Validation set:
- Unknown test instances used to iteratively evaluate an approach
- The approach is optimized on (an adapts to) the validation set
- Test set:
- Unknown test instances used for the final evaluation of an approach
- The test set represents unseen data
Cross Validation
- Stratified n-fold cross-validation
- A corpus is split into n dataset folds of equal size, usually n =10
- n runs: the evaluation results are averaged over n runs
- i-th run: the i-th fold is used for evaluation(validation). All other folds are used for development(training)
- Pros and cons of cross-validation
- Often preferred when data is small, as more data is given for training
- Cross-validation avoids potential bias in a corpus split
- Random splitting often makes the task easier, due to corpus bias
Variations of Cross validation
- Repeated cross-validation
- Often, cross-validation is repeated multiple times with different folds
- This way, coincidential effects of random splitting are accounted for
- Leave-one-out validation
- Cross-validation where n equals the number of instances
- This way, any potentia bias in the splitting is avoided
- But even more data is given for training, which makes a task easier
- Cross-validation + test set
- When doing cross-validation, a held-out test set is still important
- Otherwise, repeated development will overfit to the splitting
Comparison(Upper Bounds and Lower Bounds)
Why comparing?
- Approaches may be compared to a gold standard and to baselines to have a measure in terms of effectiveness
- Gold standard(Upper bound)
- The gold standard represents the best possible result on a given task
- For many tasks, the effectiveness that humans achieve is seen as best
- If not available, the gold standard is often equated with the ground truth in a corpus
- Baseline(lower bound)
- A baseline is an alternative approach that has been proposed before or that can easily be realized
- A new approach should be better than all baselines
Types of Baselines: of Comparasion
- trivial baselines:
- Methods that can easily be derived from a given task or dataset
- Used to evaluate whether a new approach achieves anything
- Standard baseline:
- Methods that are often used for related tasks
- Used to evaluate how hard a task is
- Sub-approaches
- Sub-approaches of new approach
- Used to analyze the impact of the different parts of an approach
- State of the art
- The best published methods for the addressed task
- Used to verify whether a new approach is best
- When does comparison work?
- Variations of a task may affect its complexity
- The same task may have different complexity on different datasets
- Only in exactly the same experiment setting, two approaches con be compared reasonably
Hypothesis Testing (Statistics)
- an entity that can take on different quantitative or qualitative values
- Independent: A variable X that is expected to affect another variable
- Dependent: A variable Y that is expected to be effected by others
Scales of variables(they are not like functions)
- Nominal: Values that represent discrete, separate categories(Lowest level of measurement-categorical)
- Ordinal: Values that can be ranked by what is better(Likert scale)(ordered from lowest to highest- win, place show)
- Interval: Values whose distance can be measured(Distance means something - temperature on thermometer)
- Ratio: Interval values that have a “true zero”(a true zero indicates the absence of what is represented by a variable)(has a true zero point - Zero dollars means no money)
Inferential Statistics
- Procedures that help study hypotheses based on values
- Used to make Inferences about a distribution beyond a given sample
Two competing hypotheses
- Research Hypothesis (H): prediction about how a change in variables will cause changes in other variables
- Null hypothesis(Ho): Antithesis to H
- If Ho is true, then any results observed in a experiment that support H are due to change or sampling error
Two types of hypotheses:
- Directional: Specifies the direction of an expected difference
- Non-directional: Specifies only that any difference is expected
Good hypotheses
- Founded in a problem statement and supported by research
- Testable, i.e., it is possible to collect data to study the hypothesis
- States an expected relationship between variables
- Phrased as simply and concisely as possible
Hypothesis test:
- A statistical procedure that determines how likely it is that the results of an experiment are due to chance or sampling error
- Tests whether a null hypothesis Ho can be rejected(and hence, H can be accepted) at come chosen significance level
Significance level alpha
- The accepted risk that Ho is wrongly rejected(Usually, alpha is set to 0.05 or to 0.01
- A choice of alpha = 0.05 means that there is no more than 5% chance that a potential rejection of H0 is wrong (In other words, with ≥ 95% confidence a potential rejection is correct
p- Value:
- The likelihood(in terms of a probability) that results are due to chance
- If p ≤ alpha, Ho is rejected. The results are seen as statistically significant
- If p> alpha, Ho cannot be rejected
Four steps of hypothesis testing
- Hypothesis: State H and Ho
- Significance level: Choose alpha(always before the test)
- Testing: Carry out an appropriate hypothesis test to get the p-value
- Decision: Depending on alpha and p, reject Ho or fail to reject it
Hypothesis tests
- A significance test needs to be chosen that fits the data
- Different tests exist that make different assumptions about the data
Parametric vs. non-parametric tests
- Parametric: More powerful and precise, i.e., it is more likely to detect a significant effect when one truly exists
- Non-parametric: Fewer assumptions and, thus, more often applicable
- Each parametric test has a non-parametric correspondent
a thing that is accepted as true or as certain to happen, without proof
Assumptions of all hypothesis tests
- Sampling: The sample is a random sample from the distribution
- Values: The values within each variable are independent
Assumption of all parametric tests
- Scale: The dependent variable has an interval or ratio scale
- Distribution: the given distributions are normally distributed
- Variance: Distributions that are compared have similar variances
Test-specific assumptions
- In addition, specific tests may have specific assumptions
- Depending on which are met, an appropriate test is chosen
The Student`s t-Test:
- A parametric hypothesis test for small samples(n ≤ 30)
- Computes a t-score from which significance can be derived
- Types: One-sample t-test, dependent t-test, independent t-test
Test-specific assumptions
- The independent variable has a nominal scale
- t-tests are robust over moderate violations of the normality assumption
One-tailed vs two-tailed
- One tailed: Test a directional hypothesis
- Two tailed: Tests a non-directional hypothesis
One sample vs paired samples
- One sample mean is compared to a know value
- Paired samples: two sample means are compared to each other
- Variation of the normal distribution for small sample sizes
- Dependent on the degrees of freedom(DoF) in an experiment
- Statistics libraries(e.g., in Python) can compute t-distributions
- Otherwise, tables exist with the significance confidences of t-values
What if the t-test assumptions are not met?
- Test-specific assumption: find other parametric test that is applicable
- Assumptions of parametric tests: find applicable non-parametric test
- Assumptions of all significance tests: Hypotheses cannot be tested