Week 4 - Evaluation Flashcards
What is evaluation
Synonymous to testing
assessing the extent to which a system/method produces expected outputs
What are the four dimensions of evaluation
manual v automatic
formative v summative
intrinsic v extrinsic
component v end to end
What is manual evaluation
involves recruitment of human subjects to assess outputs
limitations of manual evaluation
human inconsistencies
difficult to control external factors
time consuming and laborious
What is automatic evaluation
data driven
requires algorithms mimicking human assessors
eg evaluation script and metrics
What is formative evaluation
occurs during development of systems
informs designer/developers if progress has been made
usually lightweight and iterative
tends to be automatic
eg automatic test everytime some improvement to the model is incorporated
What is summative evaluation
conducted after system completion;often involves human judges
assesses if systems goals were achieved
What is intrinsic evaluation
assessment in terms of systems underlying task
eg how well does the sequence classification model perform
What is extrinsic evaluation
assessment in terms of impact of the system to an external task
eg how much faster is a human able to carry out same task - broader issue at hand
What is component evaluation
assessing each components comprising a pipeline
allows for isolating error and identifying problematic components
eg separating preprocessing from classification
What is end-to-end evaluation
assessing all components at once
provides an indication of a systems effectiveness under real-world conditions
eg measuring classification performance given raw text
What is the issue with annotated data
Humans are usually used
have different perspectives
What is annotator agreement
measured to help us decide whether we can trust the labels
2 types
What is inter-annotator agreement
agreement between human annotators
whether multiple humans consistently annotate the same item even when working independently
What is Cohen’s Kappa
A measure of chance corrected agreement
What is P(a) (Cohens kappa)
the observed agreement
proportion of times annotators agreed
What is P(e) (Cohens kappa)
expected agreement
proportion of times annotators expected to agree by chance