Empirical Methods Flashcards

1
Q

Quantitative Methods

A

Characterized by objective measurements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Qualitative Methods

A

Emphasizes the understanding of human experience

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Descriptive statistic

A

Methods for summarizing a sample or a distribution of value; used to describe phenomena

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Inferential statistic

A

Methods for drawing conclusions based on values; used to generalize inferences beyond a given sample: The average number is significantly higher than 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Elements of empirical methods in NLP

A
  • Evaluation measures: Quantification of the quality of a method, especially its effectiveness
  • Empirical experiments: Evaluation of the quality on text corpora and comparison to alternative methods
  • Hypothesis testing: Use of statistical methods to “proof” the quality of a method in comparison to others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Evaluation measures

A

Effectiveness:

  • The extent to which the output information of an approach is correct
  • High effectiveness is the primary goal of any NLP method
  • Classification measures: Accuracy precision, recall, F1-Score,….
  • Regression measures: mean absolute/squared error,…

Eficiency:

  • The costs of a method in terms of the consumption of time or space
  • Measures: Run-time, training time, memory consumption,
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Classification Effectiveness

A
  • The instances of each class can be evaluated in a binary manner
  • For each instance, check whether its class matches the ground truth
  • Positives: the class instances a given approach has inferred
  • Negatives: all other possible instances

Instance types in the evaluation:

  • True positive(TP): a positive that belongs to the ground truth
  • False positive(FP): a positive that does not belong to the ground truth
  • False negative (FN): a negative that belongs to the ground truth
  • True negative (TN): a negative that does not belong to the ground truth
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When to use accuracy?

A
  • Accuracy is adequate when all classes are of similar importance
  • Examples: Sentiment analysis, part of speech tagging,…
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When not to use accuracy?

A
  • n tasks where one class is rare, high accuracy can be achieved by never predicting the class
    • 4% spam -> 96% accuracy by always predicting “no spam”
  • This includes tasks where the correct output information covers only portions of text, such as in entity recognition
    • “Apples rocks” → Negatives: “A”, “Ap”, “App”,…
  • Accuracy is inadequate when true negatives are of low importance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Precision

A
  • The precision is a measure of the exactness of an approach
  • P answers: How many of the found instances are correct?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Recall

A
  • The recall R is a measure of the completeness of an approach
  • R answers: How many of the correct instances have been found?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

F1-score

A
  • The F1-score is the harmonic mean of precision and recall
  • F1 favors balanced over imbalanced precision and recall values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Boundary errors and Issues

A

A common error in tasks where text spans need to be annotated is to choose a wrong boundary of the span

Issues
- leads to both an FP und an FN
- Identifying nothing as positive would increase the F1-score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to deal with boundary errors

A
  • Different accounts for the issue have been proposed, but the standard F1 is still used in most evaluations
  • A relaxed evaluation is to consider some character overlap instead of exact boundaries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Evaluation of multi-class tasks

A
  • n general, each class in a multi-class task can be evaluated binarily.
  • Accuracy can be computed for any number k of classes
  • The other measures must be combined with micro- or macro-averaging
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Micro-averaged precision

A

Micro-averaging takes into account the number of instances per class, so larger classes get more importance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Macro-averaged precision:

A

Macro-averaging computes the mean result over all classes, so each class gets the same importance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Confusion matrix

A
  • each row refers to the ground-truth instances of one of k classes
  • Each column refers to the classified instances of one class
  • The cells contain the numbers of correct and incorrect classifications of a given approach
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why confusion matrices

A
  • Used to analyze errors, to see classes are confused
  • Contains all values for computing micro- and macro-averaged results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Types of prediction errors

A
  • Mean absolute error(MAE)
    • The mean difference of predicted to ground-truth values
    • The MEA is robust to outliers, i.e., it does not treat them specially
  • Mean squared error(MSE)
    • The mean squared difference of predicted to ground-truth values
    • The MSE is specifically sensitive to outliers

Sometimes, also the root mean squared error (RMSE) is computed, defined as RMSE = Sqrt(MSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Empirical Experiments:

A
  • An empirical experiment tests a hypothesis based on observatinos
  • The focus is here on effectiveness evaluation in NLP
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Intrinsic vs extrinsic effectiveness evaluation:

A
  • Intrinsic: the effectiveness of an approach is directly evaluated on the task it is made for:
    • What accuracy does a part-speech tagger XY have on the dataset D?
  • Extrinsic: the effectiveness of an approach is evaluated by measuring how effective its output is in a downstream task
    • Does the output of XY improve sentiment analysis on D?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

x

A

y

24
Q

Text corpora

A
  • A principled collection of natural language texts with know properties compiled to study a language problem
  • The texts in a corpus are often annotated, at least for the problem to be studied

Need for text corpora:

  • NLP approaches are developed and evaluated on text corpora
  • Without a corpus, it`s hard to develop a strong approach and impossible to reliably evaluate it

Annotation:

  • An annotation marks a text or a span of text as representing meta-information of a specific type
  • It may also be used to specify relations between other annotations
  • The types are specified by an annotation scheme
25
Q

Types of Annotations of corpora

A

Annotated corpora in NLP:

  • Usually, a corpus contains annotations of information types of interest in a task or domain

Manual annotations

  • The annotations of a text corpus are usually created manually
  • To assess the quality of manual annotations, inter-annotator agreement is computed based on texts annotated multiple times

Ground-truth annotations:

  • Manual annotations assumed to be correct are called the ground truth
  • NLP usually learns from ground-truth annotations

Automatic annotation:

  • Technically, NLP algorithms can be seen as just adding annotations of certain types to a processed text
  • The automatic process usually aims to mimic the manual process
26
Q

Development and Evaluation

A

Dataset: A sub-corpus of a corpus that is compiled and used for developing and/or evaluating approaches to specific tasks

Development and evaluation based on datasets:

  1. An approach is developed based on a set of training instances
  2. The approach is applied to a set of test instances
  3. The output of the approach is compared to the ground truth of the test instances using evaluation measures
  4. Steps 1-3 may be iteratively repeated to improve the approach
27
Q

Corpus splitting

A
  • The split of a corpus into datasets should represent the task well
  • The way a corpus is split implies how to evaluate
  • Main evaluations types: (Training, validation and test) vs cross-validation
28
Q

Training, Validation and Test

A
  • Training set:
    • Known instances used to develop or statistically learn an approach
    • The training set may be analyzed manually and automatically
  • Validation set:
    • Unknown test instances used to iteratively evaluate an approach
    • The approach is optimized on (an adapts to) the validation set
  • Test set:
    • Unknown test instances used for the final evaluation of an approach
    • The test set represents unseen data
29
Q

Cross Validation

A
  • Stratified n-fold cross-validation
    • A corpus is split into n dataset folds of equal size, usually n =10
    • n runs: the evaluation results are averaged over n runs
    • i-th run: the i-th fold is used for evaluation(validation). All other folds are used for development(training)
  • Pros and cons of cross-validation
    • Often preferred when data is small, as more data is given for training
    • Cross-validation avoids potential bias in a corpus split
    • Random splitting often makes the task easier, due to corpus bias
30
Q

Variations of Cross validation

A
  • Repeated cross-validation
    • Often, cross-validation is repeated multiple times with different folds
    • This way, coincidential effects of random splitting are accounted for
  • Leave-one-out validation
    • Cross-validation where n equals the number of instances
    • This way, any potentia bias in the splitting is avoided
    • But even more data is given for training, which makes a task easier
  • Cross-validation + test set
    • When doing cross-validation, a held-out test set is still important
    • Otherwise, repeated development will overfit to the splitting
31
Q

Comparison(Upper Bounds and Lower Bounds)

A

Why comparing?

  • Approaches may be compared to a gold standard and to baselines to have a measure in terms of effectiveness
  • Gold standard(Upper bound)
    • The gold standard represents the best possible result on a given task
    • For many tasks, the effectiveness that humans achieve is seen as best
    • If not available, the gold standard is often equated with the ground truth in a corpus
  • Baseline(lower bound)
    • A baseline is an alternative approach that has been proposed before or that can easily be realized
    • A new approach should be better than all baselines
32
Q

Types of Baselines: of Comparasion

A
  • trivial baselines:
    • Methods that can easily be derived from a given task or dataset
    • Used to evaluate whether a new approach achieves anything
  • Standard baseline:
    • Methods that are often used for related tasks
    • Used to evaluate how hard a task is
  • Sub-approaches
    • Sub-approaches of new approach
    • Used to analyze the impact of the different parts of an approach
  • State of the art
    • The best published methods for the addressed task
    • Used to verify whether a new approach is best
33
Q

Implications

A
  • When does comparison work?
    • Variations of a task may affect its complexity
    • The same task may have different complexity on different datasets
    • Only in exactly the same experiment setting, two approaches con be compared reasonably
34
Q

Hypothesis Testing (Statistics)
Variable

A
  • an entity that can take on different quantitative or qualitative values
    • Independent: A variable X that is expected to affect another variable
    • Dependent: A variable Y that is expected to be effected by others
35
Q

Scales of variables(they are not like functions)

A
  • Nominal: Values that represent discrete, separate categories(Lowest level of measurement-categorical)
  • Ordinal: Values that can be ranked by what is better(Likert scale)(ordered from lowest to highest- win, place show)
  • Interval: Values whose distance can be measured(Distance means something - temperature on thermometer)
  • Ratio: Interval values that have a “true zero”(a true zero indicates the absence of what is represented by a variable)(has a true zero point - Zero dollars means no money)
36
Q

Inferential Statistics

A
  • Procedures that help study hypotheses based on values
  • Used to make Inferences about a distribution beyond a given sample
37
Q

Two competing hypotheses

A
  • Research Hypothesis (H): prediction about how a change in variables will cause changes in other variables
  • Null hypothesis(Ho): Antithesis to H
  • If Ho is true, then any results observed in a experiment that support H are due to change or sampling error
38
Q

Two types of hypotheses:

A
  • Directional: Specifies the direction of an expected difference
  • Non-directional: Specifies only that any difference is expected
39
Q

Good hypotheses

A
  • Founded in a problem statement and supported by research
  • Testable, i.e., it is possible to collect data to study the hypothesis
  • States an expected relationship between variables
  • Phrased as simply and concisely as possible
40
Q

Hypothesis test:

A
  • A statistical procedure that determines how likely it is that the results of an experiment are due to chance or sampling error
  • Tests whether a null hypothesis Ho can be rejected(and hence, H can be accepted) at come chosen significance level
41
Q

Significance level alpha

A
  • The accepted risk that Ho is wrongly rejected(Usually, alpha is set to 0.05 or to 0.01
  • A choice of alpha = 0.05 means that there is no more than 5% chance that a potential rejection of H0 is wrong (In other words, with ≥ 95% confidence a potential rejection is correct
42
Q

p- Value:

A
  • The likelihood(in terms of a probability) that results are due to chance
  • If p ≤ alpha, Ho is rejected. The results are seen as statistically significant
  • If p> alpha, Ho cannot be rejected
43
Q

Four steps of hypothesis testing

A
  1. Hypothesis: State H and Ho
  2. Significance level: Choose alpha(always before the test)
  3. Testing: Carry out an appropriate hypothesis test to get the p-value
  4. Decision: Depending on alpha and p, reject Ho or fail to reject it
44
Q

Hypothesis tests

A
  • A significance test needs to be chosen that fits the data
  • Different tests exist that make different assumptions about the data
45
Q

Parametric vs. non-parametric tests

A
  • Parametric: More powerful and precise, i.e., it is more likely to detect a significant effect when one truly exists
  • Non-parametric: Fewer assumptions and, thus, more often applicable
  • Each parametric test has a non-parametric correspondent
46
Q

Assumptions

A

a thing that is accepted as true or as certain to happen, without proof

47
Q

Assumptions of all hypothesis tests

A
  • Sampling: The sample is a random sample from the distribution
  • Values: The values within each variable are independent
48
Q

Assumption of all parametric tests

A
  • Scale: The dependent variable has an interval or ratio scale
  • Distribution: the given distributions are normally distributed
  • Variance: Distributions that are compared have similar variances
49
Q

Test-specific assumptions

A
  • In addition, specific tests may have specific assumptions
  • Depending on which are met, an appropriate test is chosen
50
Q

The Student`s t-Test:

A
  • A parametric hypothesis test for small samples(n ≤ 30)
  • Computes a t-score from which significance can be derived
  • Types: One-sample t-test, dependent t-test, independent t-test
51
Q

Test-specific assumptions

A
  • The independent variable has a nominal scale
  • t-tests are robust over moderate violations of the normality assumption
52
Q

One-tailed vs two-tailed

A
  • One tailed: Test a directional hypothesis
  • Two tailed: Tests a non-directional hypothesis
53
Q

One sample vs paired samples

A
  • One sample mean is compared to a know value
  • Paired samples: two sample means are compared to each other
54
Q

t-distribution

A
  • Variation of the normal distribution for small sample sizes
  • Dependent on the degrees of freedom(DoF) in an experiment
  • Statistics libraries(e.g., in Python) can compute t-distributions
  • Otherwise, tables exist with the significance confidences of t-values
55
Q

Alternatives:
What if the t-test assumptions are not met?

A
  • Test-specific assumption: find other parametric test that is applicable
  • Assumptions of parametric tests: find applicable non-parametric test
  • Assumptions of all significance tests: Hypotheses cannot be tested