08 Evaluation Flashcards

1
Q

Gold standard/ground truth

A

The decision of wether a document is either relevant or non-relevant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Development test collection

A

Used to evaluate the IR system. Many sytems contain various weights (pa- rameters) that can be used to tune system performance. But one cant tune these parameters to maximize performance on that collection. Therefore, one must have a develpment test collection. (As in machine learning, where you have a test set that is different from the training set).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Precision

A

Precision = number of relevant retrieved items / number of retrieved items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Recall

A

Recall = number of relevant items retrieved / number of relevant items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Accuracy

A

Accuracy = (tp + tn) / (tp + fp + fn + tn)

Why accuracy is not a good measure for IR problems: Normally, non-relevant documents consists of 99.9% of the collection. To maximize the accuracy one can deem all docs as non-relevant.

Great to have two numbers to evaluate a IR system (precision and recall) because one is ofter more important than the other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

F measure

A

Precision and recall are often a tradeoff: Recall is a nondecreasing function. On the other hand, precision usually decreases as the number of retreived docs is increased. In general, we want to get some amount of recall while tolerating only a certain percentage of false positives. See wikipedia for for- mula. Values for β < 1 emphasize precision, but β > 1 emphasize recall. One cannot use the arithmetic meain because one can always get 100% recall by returning all the docs, and therfore always get a minimum of 50% arithmetic mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Precision-recall curve

A

For each set of retreived documents one can plot the precision recall curve. They have a distinct sawtooth shape. If the (k + 1 )th doc retrieved is non- relevant, then recall is the same, but precision has dropped. If it is relevant, then both precision and recall increase, and the curve jags up and to the right.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Interpolated precision

A

To remove the jiggles in a precision-recall curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Mean average precision

A

For one query: Take the average of precision value obtained for the top K docs, each time a relevant doc is retrieved. For all the queries, take the average of the values obatained with the above scheme, then this is the mean average precision. Therefore, each query counts equally.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Precision at K

A

Precision at k documents (P@k) is still a useful metric (e.g P@10 corresponds to the number of relevant results on the first search results page), but fails to take into account the positions of the relevant documents among the top k.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

R precision

A

To cope with the precision at K docs, is R-precision. It requires having a set of known relevant docs Rel, from which we calculate the precision of the top Rel documents returned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pooling

A

Given info needs and docs, you need to collect relevance assessments (over- slag). This is a time-consuming and expensive process involving human beings. For a large collection, it is usual for relevance to be assessed only for a subset of the docs for each query. The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k docs returned by a number of different IR system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Kappa statistic

A

A measue of how much agreement between judges there is on relevance judg- ments. It is designed for categorical judgments and corrects a simple agree- ment rate for the rate of chance agreement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly