Performance Evaluation Flashcards
What types of evaluation are there (3)
- performance
- adequacy
- diagnostic
What is performance evaluation (3)
based on a benchmark
organised around community/shared task
automated means of scoring
key points about gold standard data (3)
- time consuming and costly
- requires annotation guidelines to follow
- annotation done by experts
why do we use multiple annotators
to ensure reliability
what does the kappa coefficient measure here
inter annotator agreement
how do we calculate kappa coefficient
(p(a) - p(e)) / 1 - p(e)
p(a) = …
observed agreement
p(a1=y, a2=y) + p(a1=n, a2=n)
p(e) = ..
expected agreement
p(a1=y)p(a2=y) + p(a1=n)p(a2=n)
how do we interpret the kappa coefficient
slight < 0.2 < fair < 0.4 < moderate < 0.6 < substantial < 0.8 < perfect
what can we use in the non binary annotation case
scotts pi, fleiss kappa
precision =
TP / TP+FP
recall =
TP / TP+FN
f1 =
2PR / R+P
why is f1 score more informative than mean
it is the harmonic mean- it will show poor performance e.g. if prediction is always no
when is accuracy useful
if all classes are equally important