Empirical Methods Flashcards

Question

Types of Annotations of corpora

Answer 1

Annotated corpora in NLP: - Usually, a corpus contains annotations of information types of interest in a task or domain Manual annotations - The annotations of a text corpus are usually created manually - To assess the quality of manual annotations, inter-annotator agreement is computed based on texts annotated multiple times Ground-truth annotations: - Manual annotations assumed to be correct are called the ground truth - NLP usually learns from ground-truth annotations Automatic annotation: - Technically, NLP algorithms can be seen as just adding annotations of certain types to a processed text - The automatic process usually aims to mimic the manual process

Answer 2

Dataset: A sub-corpus of a corpus that is compiled and used for developing and/or evaluating approaches to specific tasks Development and evaluation based on datasets: 1. An approach is developed based on a set of training instances 2. The approach is applied to a set of test instances 3. The output of the approach is compared to the ground truth of the test instances using evaluation measures 4. Steps 1-3 may be iteratively repeated to improve the approach

Answer 3

- The split of a corpus into datasets should represent the task well - The way a corpus is split implies how to evaluate - Main evaluations types: (Training, validation and test) vs cross-validation

Answer 4

- Training set: - Known instances used to develop or statistically learn an approach - The training set may be analyzed manually and automatically - Validation set: - Unknown test instances used to iteratively evaluate an approach - The approach is optimized on (an adapts to) the validation set - Test set: - Unknown test instances used for the final evaluation of an approach - The test set represents unseen data

Answer 5

- Stratified n-fold cross-validation - A corpus is split into n dataset folds of equal size, usually n =10 - n runs: the evaluation results are averaged over n runs - i-th run: the i-th fold is used for evaluation(validation). All other folds are used for development(training) - Pros and cons of cross-validation - Often preferred when data is small, as more data is given for training - Cross-validation avoids potential bias in a corpus split - Random splitting often makes the task easier, due to corpus bias

Answer 6

- Repeated cross-validation - Often, cross-validation is repeated multiple times with different folds - This way, coincidential effects of random splitting are accounted for - Leave-one-out validation - Cross-validation where n equals the number of instances - This way, any potentia bias in the splitting is avoided - But even more data is given for training, which makes a task easier - Cross-validation + test set - When doing cross-validation, a held-out test set is still important - Otherwise, repeated development will overfit to the splitting

Answer 7

Why comparing? - Approaches may be compared to a gold standard and to baselines to have a measure in terms of effectiveness - Gold standard(Upper bound) - The gold standard represents the best possible result on a given task - For many tasks, the effectiveness that humans achieve is seen as best - If not available, the gold standard is often equated with the ground truth in a corpus - Baseline(lower bound) - A baseline is an alternative approach that has been proposed before or that can easily be realized - A new approach should be better than all baselines

Answer 8

- trivial baselines: - Methods that can easily be derived from a given task or dataset - Used to evaluate whether a new approach achieves anything - Standard baseline: - Methods that are often used for related tasks - Used to evaluate how hard a task is - Sub-approaches - Sub-approaches of new approach - Used to analyze the impact of the different parts of an approach - State of the art - The best published methods for the addressed task - Used to verify whether a new approach is best

Answer 9

- When does comparison work? - Variations of a task may affect its complexity - The same task may have different complexity on different datasets - Only in exactly the same experiment setting, two approaches con be compared reasonably

Answer 10

- an entity that can take on different quantitative or qualitative values - Independent: A variable X that is expected to affect another variable - Dependent: A variable Y that is expected to be effected by others

Answer 11

- Nominal: Values that represent discrete, separate categories(Lowest level of measurement-categorical) - Ordinal: Values that can be ranked by what is better(Likert scale)(ordered from lowest to highest- win, place show) - Interval: Values whose distance can be measured(Distance means something - temperature on thermometer) - Ratio: Interval values that have a “true zero”(a true zero indicates the absence of what is represented by a variable)(has a true zero point - Zero dollars means no money)

Answer 12

- Procedures that help study hypotheses based on values - Used to make Inferences about a distribution beyond a given sample

Answer 13

- Research Hypothesis (H): prediction about how a change in variables will cause changes in other variables - Null hypothesis(Ho): Antithesis to H - If Ho is true, then any results observed in a experiment that support H are due to change or sampling error

Answer 14

- Directional: Specifies the direction of an expected difference - Non-directional: Specifies only that any difference is expected

Answer 15

- Founded in a problem statement and supported by research - Testable, i.e., it is possible to collect data to study the hypothesis - States an expected relationship between variables - Phrased as simply and concisely as possible

Answer 16

- A statistical procedure that determines how likely it is that the results of an experiment are due to chance or sampling error - Tests whether a null hypothesis Ho can be rejected(and hence, H can be accepted) at come chosen significance level

Answer 17

- The accepted risk that Ho is wrongly rejected(Usually, alpha is set to 0.05 or to 0.01 - A choice of alpha = 0.05 means that there is no more than 5% chance that a potential rejection of H0 is wrong (In other words, with ≥ 95% confidence a potential rejection is correct

Answer 18

- The likelihood(in terms of a probability) that results are due to chance - If p ≤ alpha, Ho is rejected. The results are seen as statistically significant - If p> alpha, Ho cannot be rejected

Answer 19

1. Hypothesis: State H and Ho 2. Significance level: Choose alpha(always before the test) 3. Testing: Carry out an appropriate hypothesis test to get the p-value 4. Decision: Depending on alpha and p, reject Ho or fail to reject it

Answer 20

- A significance test needs to be chosen that fits the data - Different tests exist that make different assumptions about the data

Answer 21

- Parametric: More powerful and precise, i.e., it is more likely to detect a significant effect when one truly exists - Non-parametric: Fewer assumptions and, thus, more often applicable - Each parametric test has a non-parametric correspondent

Answer 22

a thing that is accepted as true or as certain to happen, without proof

Answer 23

- Sampling: The sample is a random sample from the distribution - Values: The values within each variable are independent

Answer 24

- Scale: The dependent variable has an interval or ratio scale - Distribution: the given distributions are normally distributed - Variance: Distributions that are compared have similar variances

Answer 25

- In addition, specific tests may have specific assumptions - Depending on which are met, an appropriate test is chosen

Answer 26

- A parametric hypothesis test for small samples(n ≤ 30) - Computes a t-score from which significance can be derived - Types: One-sample t-test, dependent t-test, independent t-test

Answer 27

- The independent variable has a nominal scale - t-tests are robust over moderate violations of the normality assumption

Answer 28

- One tailed: Test a directional hypothesis - Two tailed: Tests a non-directional hypothesis

Answer 29

- One sample mean is compared to a know value - Paired samples: two sample means are compared to each other

Answer 30

- Variation of the normal distribution for small sample sizes - Dependent on the degrees of freedom(DoF) in an experiment - Statistics libraries(e.g., in Python) can compute t-distributions - Otherwise, tables exist with the significance confidences of t-values

Answer 31

- Test-specific assumption: find other parametric test that is applicable - Assumptions of parametric tests: find applicable non-parametric test - Assumptions of all significance tests: Hypotheses cannot be tested

Empirical Methods Flashcards

(55 cards)