Week 6 - The AI Pipeline, Data Annotation, Presenting a Proposal Flashcards
What is datasheets for datasets
Datasheets for datasets are structured documents that provide detailed information about the creation, composition, and use of a dataset. They include metadata such as the dataset’s purpose, sources, collection methods, preprocessing steps, and ethical considerations. These datasheets aim to enhance transparency, reproducibility, and responsible usage of data.
What are the 4 points in the datasheets for datasets
- Motivation for dataset contruction
- Dataset composition (recommended splits, evaluation metrics)
- Data collection process (time period, sampling strategy)
- Legal and Ethical considerations (consent)
What is the difference between annotation and extraction
We can make a distinction between unlabelled data that was merely extracted, and labelled data that was annotated (by humans)
What is crowdsourcing
gathering and rewarding human participants if often done through platforms
What are the review checks for crowdsourcing
- Instructions: does the study include intructions
- Recruitment: is it clear how theyre recruited
- Consent: how consent was obtained
- Demographics: is it clear what the demographic is
What could go wrong with crowdsourcing
Certain words could be overused because of the demographic of crowdsourcing
What is annotator aggregation
Annotator aggregation is a process used in research, particularly in data labeling and machine learning, where multiple annotations or labels provided by different annotators are combined to produce a single, more accurate and reliable label for each data point. This helps to reduce individual biases and errors, improving the overall quality of the dataset.
What is interannotator agreement?
Interannotator agreement is a measure used in research to assess the level of consistency and reliability between different annotators who label or categorize the same set of data. High interannotator agreement indicates that the annotators are consistent in their evaluations, suggesting that the labeling process is reliable and the data annotations are dependable.
What interannotator agreement is best used for graded annotation
mean. the mean was taken as the gold standard
What internannotator agreement should be used for classifcation
Fleiss’ Kappa
what is Fleiss’ Kappa, give formula
observed above chance agreement/ max. above chance agreement
What is annotator disagreement
Annotator disagreement occurs when multiple annotators provide differing labels or classifications for the same data point. This disagreement can highlight areas of ambiguity or subjectivity in the data and can be used to identify problematic or unclear items that may require further clarification or consensus-building among annotators.
What is KL divergence
KL (Kullback-Leibler) divergence is a statistical measure used to quantify the difference between two probability distributions. It indicates how much information is lost when one probability distribution is used to approximate another.
What is an interim summary
An interim summary is a preliminary report that provides an update on the progress, findings, and current status of a research project or study before it is completed. It typically includes key data, observations, and insights gathered up to that point, helping researchers and stakeholders assess the direction and effectiveness of the ongoing research.
What are 7 points in Interim summary
- Datasets
- Peer review
- Annotator agreement
- Human disagreement
- Modelling disagreement
- Re-calibration
- Read more