Week 6 - The AI Pipeline, Data Annotation, Presenting a Proposal Flashcards by Alexander Bazba

What is datasheets for datasets

Datasheets for datasets are structured documents that provide detailed information about the creation, composition, and use of a dataset. They include metadata such as the dataset’s purpose, sources, collection methods, preprocessing steps, and ethical considerations. These datasheets aim to enhance transparency, reproducibility, and responsible usage of data.

How well did you know this?

Not at all

Perfectly

What are the 4 points in the datasheets for datasets

Motivation for dataset contruction
Dataset composition (recommended splits, evaluation metrics)
Data collection process (time period, sampling strategy)
Legal and Ethical considerations (consent)

How well did you know this?

Not at all

Perfectly

What is the difference between annotation and extraction

We can make a distinction between unlabelled data that was merely extracted, and labelled data that was annotated (by humans)

How well did you know this?

Not at all

Perfectly

What is crowdsourcing

gathering and rewarding human participants if often done through platforms

How well did you know this?

Not at all

Perfectly

What are the review checks for crowdsourcing

Instructions: does the study include intructions
Recruitment: is it clear how theyre recruited
Consent: how consent was obtained
Demographics: is it clear what the demographic is

How well did you know this?

Not at all

Perfectly

What could go wrong with crowdsourcing

Certain words could be overused because of the demographic of crowdsourcing

How well did you know this?

Not at all

Perfectly

What is annotator aggregation

Annotator aggregation is a process used in research, particularly in data labeling and machine learning, where multiple annotations or labels provided by different annotators are combined to produce a single, more accurate and reliable label for each data point. This helps to reduce individual biases and errors, improving the overall quality of the dataset.

How well did you know this?

Not at all

Perfectly

What is interannotator agreement?

Interannotator agreement is a measure used in research to assess the level of consistency and reliability between different annotators who label or categorize the same set of data. High interannotator agreement indicates that the annotators are consistent in their evaluations, suggesting that the labeling process is reliable and the data annotations are dependable.

How well did you know this?

Not at all

Perfectly

What interannotator agreement is best used for graded annotation

mean. the mean was taken as the gold standard

How well did you know this?

Not at all

Perfectly

What internannotator agreement should be used for classifcation

Fleiss’ Kappa

How well did you know this?

Not at all

Perfectly

what is Fleiss’ Kappa, give formula

observed above chance agreement/ max. above chance agreement

How well did you know this?

Not at all

Perfectly

What is annotator disagreement

Annotator disagreement occurs when multiple annotators provide differing labels or classifications for the same data point. This disagreement can highlight areas of ambiguity or subjectivity in the data and can be used to identify problematic or unclear items that may require further clarification or consensus-building among annotators.

How well did you know this?

Not at all

Perfectly

What is KL divergence

KL (Kullback-Leibler) divergence is a statistical measure used to quantify the difference between two probability distributions. It indicates how much information is lost when one probability distribution is used to approximate another.

How well did you know this?

Not at all

Perfectly

What is an interim summary

An interim summary is a preliminary report that provides an update on the progress, findings, and current status of a research project or study before it is completed. It typically includes key data, observations, and insights gathered up to that point, helping researchers and stakeholders assess the direction and effectiveness of the ongoing research.

How well did you know this?

Not at all

Perfectly

What are 7 points in Interim summary

Datasets
Peer review
Annotator agreement
Human disagreement
Modelling disagreement
Re-calibration
Read more

How well did you know this?

Not at all

Perfectly

what are hyperparameters

Study These Flashcards

e.x: learning rate, dropout rate

What is a training regime

Study These Flashcards

Training for multiple epochs, testing the model that achieved highest development accuracy
| Trainig set | Development set | Test set |

what is parameter sweep

Study These Flashcards

setting a range and step size of all parameters, and training models over all combinations

what is seed-averaging

Study These Flashcards

to avoid lucky seeds, taking average of models with different seeds (but the same parameter settings)

What are ablation studies

Study These Flashcards

Ablation studies are experiments where specific components or features of a model are systematically removed or altered to assess their impact on the model’s performance. This technique helps researchers understand the contribution and importance of individual parts of the model, guiding optimization and improving overall model design.

What is the goal of abilation study

Study These Flashcards

Goal is to explain whether certain parameters, or external factors may have influenced the achieved performance rather than the hypothesized ones.

What is the structure of a research proposal.

name 6

Study These Flashcards

Background/context
Research question
Contributions
Methodology
Planning
Resources

What is the majority class label in interannotator agreement

Study These Flashcards

The majority class label is the label or category that appears most frequently in a dataset. In the context of classification tasks, it represents the most common class among the data points. Identifying the majority class label is useful for baseline comparisons and understanding class distribution in imbalanced datasets.

why is it important to clarify how human participants were rewarded for their annotation?

Because the quality of annotation increases as the reward increases
Because the reward influences the target demographic
Because crowdworkers are at risk of getting exploited
Because their data will influence large language mode

Study These Flashcards

Because crowdworkers are at risk of getting exploited

what are interval tasks

Interval tasks are tasks that involve measuring or recording data at specified intervals of time. These tasks are designed to collect information periodically, allowing researchers to analyze changes and trends over time. Interval tasks are common in longitudinal studies, time-series analysis, and various experimental designs where tracking the progression of variables is crucial.

What is a correlation coefficient

A correlation coefficient is a statistical measure that quantifies the degree and direction of the linear relationship between two variables. Ranging from -1 to +1, a value of +1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no linear correlation. Common types include Pearson's correlation coefficient and Spearman's rank correlation

Week 6 - The AI Pipeline, Data Annotation, Presenting a Proposal Flashcards

(26 cards)