Week 4 - Evaluation Flashcards

1
Q

What is evaluation

A

Synonymous to testing
assessing the extent to which a system/method produces expected outputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the four dimensions of evaluation

A

manual v automatic
formative v summative
intrinsic v extrinsic
component v end to end

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is manual evaluation

A

involves recruitment of human subjects to assess outputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

limitations of manual evaluation

A

human inconsistencies
difficult to control external factors
time consuming and laborious

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is automatic evaluation

A

data driven
requires algorithms mimicking human assessors
eg evaluation script and metrics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is formative evaluation

A

occurs during development of systems
informs designer/developers if progress has been made
usually lightweight and iterative
tends to be automatic

eg automatic test everytime some improvement to the model is incorporated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is summative evaluation

A

conducted after system completion;often involves human judges
assesses if systems goals were achieved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is intrinsic evaluation

A

assessment in terms of systems underlying task

eg how well does the sequence classification model perform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is extrinsic evaluation

A

assessment in terms of impact of the system to an external task

eg how much faster is a human able to carry out same task - broader issue at hand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is component evaluation

A

assessing each components comprising a pipeline
allows for isolating error and identifying problematic components

eg separating preprocessing from classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is end-to-end evaluation

A

assessing all components at once
provides an indication of a systems effectiveness under real-world conditions

eg measuring classification performance given raw text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the issue with annotated data

A

Humans are usually used
have different perspectives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is annotator agreement

A

measured to help us decide whether we can trust the labels
2 types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is inter-annotator agreement

A

agreement between human annotators
whether multiple humans consistently annotate the same item even when working independently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Cohen’s Kappa

A

A measure of chance corrected agreement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is P(a) (Cohens kappa)

A

the observed agreement
proportion of times annotators agreed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is P(e) (Cohens kappa)

A

expected agreement
proportion of times annotators expected to agree by chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is P(e) for binary classification

A

P(A1=Yes)P(A2=Yes) + P(A1=No)P(A2=No)
Assume A1 and A2 are independent

19
Q

What does negative kappa mean

A

disagreement

20
Q

what does 0 kappa mean

A

no agreement

21
Q

what does positive kappa mean

A

agreement, but specifics are defined differently by academics
slight < 0.2 < fair< o.4 < moderate <0.6 < substantial < 0.8 <perfect

22
Q

What is F-score

A

used for NER
the annotations from one of the annotators is considered a gold standard (reference)
the annotations from another annotator is considered as a response - measured against reference

23
Q

what is evaluation metric also known as

A

evaluation measure
figure of merit

24
Q

what is baseline

A

the measure we want to improve upon

25
Q

what is gold truth

A

= reference = ground truth
what is considered correct

26
Q

What is system output

A

= response = predictions
what we are evaluating

27
Q

What is item

A

unit of analysis
eg sequence, sequence pair, token, span

28
Q

what is accuracy (IAA)

A

ratio of matches between response and reference to total number of items
Σagri / n
or number correct / n
or TP+TN / all

agr = 1 is match, 0 is reference and response do not match

29
Q

What is accuracy from confusion matrix

A

TP + TN / TP+FN+FP+TN

30
Q

What is the limitation of accuracy

A

suitable only for balanced data

eg if one response annotator only marks No but the class is unbalanced with negatives - will result in high accuracy

31
Q

What is precision

A

proportion of labelled items correct
TP / TP +FP

32
Q

What is recall

A

proportion of correct items that are labelled
TP / TP+FN

33
Q

What is f score

A

weighted harmonic mean
Fβ = (β² + 1)PR / ((β²P) + R)

typically β = 1 hence
F = 2PR / P+R

34
Q

What do bigger values of β mean

A

more emphasis on recall

35
Q

When is macro evaluation used

A

when we want to measure performance over all categories

36
Q

What is ‘support’

A

number of instances for that class in the reference
eg in the gold standard there are 83 entities labelled as person
Person support = 83

37
Q

What is macro averaging

A

simply calculate average of P, R and F-score
over categories
(sum each precision and divided by number of categories)

get an average value each for P, R and F score

38
Q

What is weighed macro averaging

A

sum up the per-category product of metric value and weight
ie n(support) / n(total_support)

eg for precision instead of just sum and divide by total
the precision value for each category is multiplied by what fraction of the total support it covers:

(0.94(111/133) + (0.87/22/133))
where 0.94, 0.87 is the precision
111,22 are the support

(accounts for imbalance)

39
Q

What is micro averaging

A

ignore table values for precision, recall f score
Pool together the TPs, FPs and FNs across all categories
ignoring precision recall fscore

eg micro average Precision
total TP(for all categories)/ total TP+total FP

40
Q

Which metric: if all classes are equally important

A

macro averaging

41
Q

Which metric: if majority class is more important

A

weighted macro averaging

42
Q

Which metric: otherwise

A

micro averaging

eg majority class for hate speech detection: Non-hate
but important class: hate

aka unbalanced but we dont want the majority class

43
Q

What is Exact Match

A

the percentage of predictions exactly matching the ground truth answers

eg response = cloud, reference = within a cloud
EM = 0

44
Q

Why is accuracy not used for a lot of tasks

A

only usually used for sequence and pairwise sequence classification
this is because other data is often imbalanced
use f-score instead