Performance Evaluation Flashcards

1
Q

What types of evaluation are there (3)

A
  • performance
  • adequacy
  • diagnostic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is performance evaluation (3)

A

based on a benchmark
organised around community/shared task
automated means of scoring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

key points about gold standard data (3)

A
  • time consuming and costly
  • requires annotation guidelines to follow
  • annotation done by experts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

why do we use multiple annotators

A

to ensure reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what does the kappa coefficient measure here

A

inter annotator agreement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how do we calculate kappa coefficient

A

(p(a) - p(e)) / 1 - p(e)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

p(a) = …

A

observed agreement

p(a1=y, a2=y) + p(a1=n, a2=n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

p(e) = ..

A

expected agreement

p(a1=y)p(a2=y) + p(a1=n)p(a2=n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how do we interpret the kappa coefficient

A

slight < 0.2 < fair < 0.4 < moderate < 0.6 < substantial < 0.8 < perfect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what can we use in the non binary annotation case

A

scotts pi, fleiss kappa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

precision =

A

TP / TP+FP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

recall =

A

TP / TP+FN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

f1 =

A

2PR / R+P

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

why is f1 score more informative than mean

A

it is the harmonic mean- it will show poor performance e.g. if prediction is always no

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

when is accuracy useful

A

if all classes are equally important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what kinds of averages can we use when we have multiple categories

A

macro average, micro average

17
Q

what is macro average

A

take the average

18
Q

what is micro average

A

pool tps, fps and fns. less sensitive to class imbalance

19
Q

what is olympic judging

A

if there is not enough data for a gold standard, a committee of judges determines whether a proposal is relevant and close to the desired result.

not reproducible

20
Q

what is adequacy evaluation

A

evaluation as seen by users, not quantifiable and interdependent. Judging the external quality

21
Q

what are some of the factors for adequacy evaluation

A

adaptability, integrity, efficiency, robustness, correctness, reliability, usability, accuracy

22
Q

what is diagnostic evaluation

A

concerned with evaluation as seen by developers

23
Q

what are some of the factors for diagnostic evaluation

A

profitability, reusability, maintainability, testability, understandability, flexibility, readability

24
Q

why cant we use a NLP test suite

A

the range of phenomena is hard to anticipate