Performance Evaluation Flashcards
What types of evaluation are there (3)
- performance
- adequacy
- diagnostic
What is performance evaluation (3)
based on a benchmark
organised around community/shared task
automated means of scoring
key points about gold standard data (3)
- time consuming and costly
- requires annotation guidelines to follow
- annotation done by experts
why do we use multiple annotators
to ensure reliability
what does the kappa coefficient measure here
inter annotator agreement
how do we calculate kappa coefficient
(p(a) - p(e)) / 1 - p(e)
p(a) = …
observed agreement
p(a1=y, a2=y) + p(a1=n, a2=n)
p(e) = ..
expected agreement
p(a1=y)p(a2=y) + p(a1=n)p(a2=n)
how do we interpret the kappa coefficient
slight < 0.2 < fair < 0.4 < moderate < 0.6 < substantial < 0.8 < perfect
what can we use in the non binary annotation case
scotts pi, fleiss kappa
precision =
TP / TP+FP
recall =
TP / TP+FN
f1 =
2PR / R+P
why is f1 score more informative than mean
it is the harmonic mean- it will show poor performance e.g. if prediction is always no
when is accuracy useful
if all classes are equally important
what kinds of averages can we use when we have multiple categories
macro average, micro average
what is macro average
take the average
what is micro average
pool tps, fps and fns. less sensitive to class imbalance
what is olympic judging
if there is not enough data for a gold standard, a committee of judges determines whether a proposal is relevant and close to the desired result.
not reproducible
what is adequacy evaluation
evaluation as seen by users, not quantifiable and interdependent. Judging the external quality
what are some of the factors for adequacy evaluation
adaptability, integrity, efficiency, robustness, correctness, reliability, usability, accuracy
what is diagnostic evaluation
concerned with evaluation as seen by developers
what are some of the factors for diagnostic evaluation
profitability, reusability, maintainability, testability, understandability, flexibility, readability
why cant we use a NLP test suite
the range of phenomena is hard to anticipate