Week 6 - The AI Pipeline, Data Annotation, Presenting a Proposal Flashcards

1
Q

What is datasheets for datasets

A

Datasheets for datasets are structured documents that provide detailed information about the creation, composition, and use of a dataset. They include metadata such as the dataset’s purpose, sources, collection methods, preprocessing steps, and ethical considerations. These datasheets aim to enhance transparency, reproducibility, and responsible usage of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 4 points in the datasheets for datasets

A
  1. Motivation for dataset contruction
  2. Dataset composition (recommended splits, evaluation metrics)
  3. Data collection process (time period, sampling strategy)
  4. Legal and Ethical considerations (consent)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between annotation and extraction

A

We can make a distinction between unlabelled data that was merely extracted, and labelled data that was annotated (by humans)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is crowdsourcing

A

gathering and rewarding human participants if often done through platforms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the review checks for crowdsourcing

A
  1. Instructions: does the study include intructions
  2. Recruitment: is it clear how theyre recruited
  3. Consent: how consent was obtained
  4. Demographics: is it clear what the demographic is
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What could go wrong with crowdsourcing

A

Certain words could be overused because of the demographic of crowdsourcing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is annotator aggregation

A

Annotator aggregation is a process used in research, particularly in data labeling and machine learning, where multiple annotations or labels provided by different annotators are combined to produce a single, more accurate and reliable label for each data point. This helps to reduce individual biases and errors, improving the overall quality of the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is interannotator agreement?

A

Interannotator agreement is a measure used in research to assess the level of consistency and reliability between different annotators who label or categorize the same set of data. High interannotator agreement indicates that the annotators are consistent in their evaluations, suggesting that the labeling process is reliable and the data annotations are dependable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What interannotator agreement is best used for graded annotation

A

mean. the mean was taken as the gold standard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What internannotator agreement should be used for classifcation

A

Fleiss’ Kappa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is Fleiss’ Kappa, give formula

A

observed above chance agreement/ max. above chance agreement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is annotator disagreement

A

Annotator disagreement occurs when multiple annotators provide differing labels or classifications for the same data point. This disagreement can highlight areas of ambiguity or subjectivity in the data and can be used to identify problematic or unclear items that may require further clarification or consensus-building among annotators.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is KL divergence

A

KL (Kullback-Leibler) divergence is a statistical measure used to quantify the difference between two probability distributions. It indicates how much information is lost when one probability distribution is used to approximate another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is an interim summary

A

An interim summary is a preliminary report that provides an update on the progress, findings, and current status of a research project or study before it is completed. It typically includes key data, observations, and insights gathered up to that point, helping researchers and stakeholders assess the direction and effectiveness of the ongoing research.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are 7 points in Interim summary

A
  1. Datasets
  2. Peer review
  3. Annotator agreement
  4. Human disagreement
  5. Modelling disagreement
  6. Re-calibration
  7. Read more
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what are hyperparameters

A

e.x: learning rate, dropout rate

17
Q

What is a training regime

A

Training for multiple epochs, testing the model that achieved highest development accuracy
| Trainig set | Development set | Test set |

18
Q

what is parameter sweep

A

setting a range and step size of all parameters, and training models over all combinations

19
Q

what is seed-averaging

A

to avoid lucky seeds, taking average of models with different seeds (but the same parameter settings)

20
Q

What are ablation studies

A

Ablation studies are experiments where specific components or features of a model are systematically removed or altered to assess their impact on the model’s performance. This technique helps researchers understand the contribution and importance of individual parts of the model, guiding optimization and improving overall model design.

21
Q

What is the goal of abilation study

A

Goal is to explain whether certain parameters, or external factors may have influenced the achieved performance rather than the hypothesized ones.

22
Q

What is the structure of a research proposal.

name 6

A
  1. Background/context
  2. Research question
  3. Contributions
  4. Methodology
  5. Planning
  6. Resources
23
Q

What is the majority class label in interannotator agreement

A

The majority class label is the label or category that appears most frequently in a dataset. In the context of classification tasks, it represents the most common class among the data points. Identifying the majority class label is useful for baseline comparisons and understanding class distribution in imbalanced datasets.

23
Q

why is it important to clarify how human participants were rewarded for their annotation?

  1. Because the quality of annotation increases as the reward increases
  2. Because the reward influences the target demographic
  3. Because crowdworkers are at risk of getting exploited
  4. Because their data will influence large language mode
A
  1. Because crowdworkers are at risk of getting exploited
24
Q

what are interval tasks

A

Interval tasks are tasks that involve measuring or recording data at specified intervals of time. These tasks are designed to collect information periodically, allowing researchers to analyze changes and trends over time. Interval tasks are common in longitudinal studies, time-series analysis, and various experimental designs where tracking the progression of variables is crucial.

25
Q

What is a correlation coefficient

A

A correlation coefficient is a statistical measure that quantifies the degree and direction of the linear relationship between two variables. Ranging from -1 to +1, a value of +1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no linear correlation. Common types include Pearson’s correlation coefficient and Spearman’s rank correlation