Empirical Methods Flashcards

1
Q

Quantitative Methods

A

Characterized by objective measurements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Qualitative Methods

A

Emphasizes the understanding of human experience

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Descriptive statistic

A

Methods for summarizing a sample or a distribution of value; used to describe phenomena

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Inferential statistic

A

Methods for drawing conclusions based on values; used to generalize inferences beyond a given sample: The average number is significantly higher than 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Elements of empirical methods in NLP

A
  • Evaluation measures: Quantification of the quality of a method, especially its effectiveness
  • Empirical experiments: Evaluation of the quality on text corpora and comparison to alternative methods
  • Hypothesis testing: Use of statistical methods to “proof” the quality of a method in comparison to others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Evaluation measures

A

Effectiveness:

  • The extent to which the output information of an approach is correct
  • High effectiveness is the primary goal of any NLP method
  • Classification measures: Accuracy precision, recall, F1-Score,….
  • Regression measures: mean absolute/squared error,…

Eficiency:

  • The costs of a method in terms of the consumption of time or space
  • Measures: Run-time, training time, memory consumption,
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Classification Effectiveness

A
  • The instances of each class can be evaluated in a binary manner
  • For each instance, check whether its class matches the ground truth
  • Positives: the class instances a given approach has inferred
  • Negatives: all other possible instances

Instance types in the evaluation:

  • True positive(TP): a positive that belongs to the ground truth
  • False positive(FP): a positive that does not belong to the ground truth
  • False negative (FN): a negative that belongs to the ground truth
  • True negative (TN): a negative that does not belong to the ground truth
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When to use accuracy?

A
  • Accuracy is adequate when all classes are of similar importance
  • Examples: Sentiment analysis, part of speech tagging,…
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When not to use accuracy?

A
  • n tasks where one class is rare, high accuracy can be achieved by never predicting the class
    • 4% spam -> 96% accuracy by always predicting “no spam”
  • This includes tasks where the correct output information covers only portions of text, such as in entity recognition
    • “Apples rocks” → Negatives: “A”, “Ap”, “App”,…
  • Accuracy is inadequate when true negatives are of low importance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Precision

A
  • The precision is a measure of the exactness of an approach
  • P answers: How many of the found instances are correct?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Recall

A
  • The recall R is a measure of the completeness of an approach
  • R answers: How many of the correct instances have been found?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

F1-score

A
  • The F1-score is the harmonic mean of precision and recall
  • F1 favors balanced over imbalanced precision and recall values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Boundary errors and Issues

A

A common error in tasks where text spans need to be annotated is to choose a wrong boundary of the span

Issues
- leads to both an FP und an FN
- Identifying nothing as positive would increase the F1-score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to deal with boundary errors

A
  • Different accounts for the issue have been proposed, but the standard F1 is still used in most evaluations
  • A relaxed evaluation is to consider some character overlap instead of exact boundaries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Evaluation of multi-class tasks

A
  • n general, each class in a multi-class task can be evaluated binarily.
  • Accuracy can be computed for any number k of classes
  • The other measures must be combined with micro- or macro-averaging
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Micro-averaged precision

A

Micro-averaging takes into account the number of instances per class, so larger classes get more importance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Macro-averaged precision:

A

Macro-averaging computes the mean result over all classes, so each class gets the same importance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Confusion matrix

A
  • each row refers to the ground-truth instances of one of k classes
  • Each column refers to the classified instances of one class
  • The cells contain the numbers of correct and incorrect classifications of a given approach
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why confusion matrices

A
  • Used to analyze errors, to see classes are confused
  • Contains all values for computing micro- and macro-averaged results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Types of prediction errors

A
  • Mean absolute error(MAE)
    • The mean difference of predicted to ground-truth values
    • The MEA is robust to outliers, i.e., it does not treat them specially
  • Mean squared error(MSE)
    • The mean squared difference of predicted to ground-truth values
    • The MSE is specifically sensitive to outliers

Sometimes, also the root mean squared error (RMSE) is computed, defined as RMSE = Sqrt(MSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Empirical Experiments:

A
  • An empirical experiment tests a hypothesis based on observatinos
  • The focus is here on effectiveness evaluation in NLP
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Intrinsic vs extrinsic effectiveness evaluation:

A
  • Intrinsic: the effectiveness of an approach is directly evaluated on the task it is made for:
    • What accuracy does a part-speech tagger XY have on the dataset D?
  • Extrinsic: the effectiveness of an approach is evaluated by measuring how effective its output is in a downstream task
    • Does the output of XY improve sentiment analysis on D?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Text corpora

A
  • A principled collection of natural language texts with know properties compiled to study a language problem
  • The texts in a corpus are often annotated, at least for the problem to be studied

Need for text corpora:

  • NLP approaches are developed and evaluated on text corpora
  • Without a corpus, it`s hard to develop a strong approach and impossible to reliably evaluate it

Annotation:

  • An annotation marks a text or a span of text as representing meta-information of a specific type
  • It may also be used to specify relations between other annotations
  • The types are specified by an annotation scheme
25
Types of Annotations of corpora
Annotated corpora in NLP: - Usually, a corpus contains annotations of information types of interest in a task or domain Manual annotations - The annotations of a text corpus are usually created manually - To assess the quality of manual annotations, inter-annotator agreement is computed based on texts annotated multiple times Ground-truth annotations: - Manual annotations assumed to be correct are called the ground truth - NLP usually learns from ground-truth annotations Automatic annotation: - Technically, NLP algorithms can be seen as just adding annotations of certain types to a processed text - The automatic process usually aims to mimic the manual process
26
Development and Evaluation
Dataset: A sub-corpus of a corpus that is compiled and used for developing and/or evaluating approaches to specific tasks Development and evaluation based on datasets: 1. An approach is developed based on a set of training instances 2. The approach is applied to a set of test instances 3. The output of the approach is compared to the ground truth of the test instances using evaluation measures 4. Steps 1-3 may be iteratively repeated to improve the approach
27
Corpus splitting
- The split of a corpus into datasets should represent the task well - The way a corpus is split implies how to evaluate - Main evaluations types: (Training, validation and test) vs cross-validation
28
Training, Validation and Test
- Training set: - Known instances used to develop or statistically learn an approach - The training set may be analyzed manually and automatically - Validation set: - Unknown test instances used to iteratively evaluate an approach - The approach is optimized on (an adapts to) the validation set - Test set: - Unknown test instances used for the final evaluation of an approach - The test set represents unseen data
29
Cross Validation
- Stratified n-fold cross-validation - A corpus is split into n dataset folds of equal size, usually n =10 - n runs: the evaluation results are averaged over n runs - i-th run: the i-th fold is used for evaluation(validation). All other folds are used for development(training) - Pros and cons of cross-validation - Often preferred when data is small, as more data is given for training - Cross-validation avoids potential bias in a corpus split - Random splitting often makes the task easier, due to corpus bias
30
Variations of Cross validation
- Repeated cross-validation - Often, cross-validation is repeated multiple times with different folds - This way, coincidential effects of random splitting are accounted for - Leave-one-out validation - Cross-validation where n equals the number of instances - This way, any potentia bias in the splitting is avoided - But even more data is given for training, which makes a task easier - Cross-validation + test set - When doing cross-validation, a held-out test set is still important - Otherwise, repeated development will overfit to the splitting
31
Comparison(Upper Bounds and Lower Bounds)
Why comparing? - Approaches may be compared to a gold standard and to baselines to have a measure in terms of effectiveness - Gold standard(Upper bound) - The gold standard represents the best possible result on a given task - For many tasks, the effectiveness that humans achieve is seen as best - If not available, the gold standard is often equated with the ground truth in a corpus - Baseline(lower bound) - A baseline is an alternative approach that has been proposed before or that can easily be realized - A new approach should be better than all baselines
32
Types of Baselines: of Comparasion
- trivial baselines: - Methods that can easily be derived from a given task or dataset - Used to evaluate whether a new approach achieves anything - Standard baseline: - Methods that are often used for related tasks - Used to evaluate how hard a task is - Sub-approaches - Sub-approaches of new approach - Used to analyze the impact of the different parts of an approach - State of the art - The best published methods for the addressed task - Used to verify whether a new approach is best
33
Implications
- When does comparison work? - Variations of a task may affect its complexity - The same task may have different complexity on different datasets - Only in exactly the same experiment setting, two approaches con be compared reasonably
34
Hypothesis Testing (Statistics) Variable
- an entity that can take on different quantitative or qualitative values - Independent: A variable X that is expected to affect another variable - Dependent: A variable Y that is expected to be effected by others
35
Scales of variables(they are not like functions)
- Nominal: Values that represent discrete, separate categories(Lowest level of measurement-categorical) - Ordinal: Values that can be ranked by what is better(Likert scale)(ordered from lowest to highest- win, place show) - Interval: Values whose distance can be measured(Distance means something - temperature on thermometer) - Ratio: Interval values that have a “true zero”(a true zero indicates the absence of what is represented by a variable)(has a true zero point - Zero dollars means no money)
36
Inferential Statistics
- Procedures that help study hypotheses based on values - Used to make Inferences about a distribution beyond a given sample
37
Two competing hypotheses
- Research Hypothesis (H): prediction about how a change in variables will cause changes in other variables - Null hypothesis(Ho): Antithesis to H - If Ho is true, then any results observed in a experiment that support H are due to change or sampling error
38
Two types of hypotheses:
- Directional: Specifies the direction of an expected difference - Non-directional: Specifies only that any difference is expected
39
Good hypotheses
- Founded in a problem statement and supported by research - Testable, i.e., it is possible to collect data to study the hypothesis - States an expected relationship between variables - Phrased as simply and concisely as possible
40
Hypothesis test:
- A statistical procedure that determines how likely it is that the results of an experiment are due to chance or sampling error - Tests whether a null hypothesis Ho can be rejected(and hence, H can be accepted) at come chosen significance level
41
Significance level alpha
- The accepted risk that Ho is wrongly rejected(Usually, alpha is set to 0.05 or to 0.01 - A choice of alpha = 0.05 means that there is no more than 5% chance that a potential rejection of H0 is wrong (In other words, with ≥ 95% confidence a potential rejection is correct
42
p- Value:
- The likelihood(in terms of a probability) that results are due to chance - If p ≤ alpha, Ho is rejected. The results are seen as statistically significant - If p> alpha, Ho cannot be rejected
43
Four steps of hypothesis testing
1. Hypothesis: State H and Ho 2. Significance level: Choose alpha(always before the test) 3. Testing: Carry out an appropriate hypothesis test to get the p-value 4. Decision: Depending on alpha and p, reject Ho or fail to reject it
44
Hypothesis tests
- A significance test needs to be chosen that fits the data - Different tests exist that make different assumptions about the data
45
Parametric vs. non-parametric tests
- Parametric: More powerful and precise, i.e., it is more likely to detect a significant effect when one truly exists - Non-parametric: Fewer assumptions and, thus, more often applicable - Each parametric test has a non-parametric correspondent
46
Assumptions
a thing that is accepted as true or as certain to happen, without proof
47
Assumptions of all hypothesis tests
- Sampling: The sample is a random sample from the distribution - Values: The values within each variable are independent
48
Assumption of all parametric tests
- Scale: The dependent variable has an interval or ratio scale - Distribution: the given distributions are normally distributed - Variance: Distributions that are compared have similar variances
49
Test-specific assumptions
- In addition, specific tests may have specific assumptions - Depending on which are met, an appropriate test is chosen
50
The Student`s t-Test:
- A parametric hypothesis test for small samples(n ≤ 30) - Computes a t-score from which significance can be derived - Types: One-sample t-test, dependent t-test, independent t-test
51
Test-specific assumptions
- The independent variable has a nominal scale - t-tests are robust over moderate violations of the normality assumption
52
One-tailed vs two-tailed
- One tailed: Test a directional hypothesis - Two tailed: Tests a non-directional hypothesis
53
One sample vs paired samples
- One sample mean is compared to a know value - Paired samples: two sample means are compared to each other
54
t-distribution
- Variation of the normal distribution for small sample sizes - Dependent on the degrees of freedom(DoF) in an experiment - Statistics libraries(e.g., in Python) can compute t-distributions - Otherwise, tables exist with the significance confidences of t-values
55
Alternatives: What if the t-test assumptions are not met?
- Test-specific assumption: find other parametric test that is applicable - Assumptions of parametric tests: find applicable non-parametric test - Assumptions of all significance tests: Hypotheses cannot be tested