08 Evaluation Flashcards

Question 1

Q

Gold standard/ground truth

Answer

A

The decision of wether a document is either relevant or non-relevant.

Question 2

Q

Development test collection

Answer

A

Used to evaluate the IR system. Many sytems contain various weights (pa- rameters) that can be used to tune system performance. But one cant tune these parameters to maximize performance on that collection. Therefore, one must have a develpment test collection. (As in machine learning, where you have a test set that is different from the training set).

Question 3

Q

Precision

Answer

A

Precision = number of relevant retrieved items / number of retrieved items

Question 4

Q

Recall

Answer

A

Recall = number of relevant items retrieved / number of relevant items

Question 5

Q

Accuracy

Answer

A

Accuracy = (tp + tn) / (tp + fp + fn + tn)

Why accuracy is not a good measure for IR problems: Normally, non-relevant documents consists of 99.9% of the collection. To maximize the accuracy one can deem all docs as non-relevant.

Great to have two numbers to evaluate a IR system (precision and recall) because one is ofter more important than the other.

Question 6

Q

F measure

Answer

A

Precision and recall are often a tradeoff: Recall is a nondecreasing function. On the other hand, precision usually decreases as the number of retreived docs is increased. In general, we want to get some amount of recall while tolerating only a certain percentage of false positives. See wikipedia for for- mula. Values for β < 1 emphasize precision, but β > 1 emphasize recall. One cannot use the arithmetic meain because one can always get 100% recall by returning all the docs, and therfore always get a minimum of 50% arithmetic mean.

Question 7

Q

Precision-recall curve

Answer

A

For each set of retreived documents one can plot the precision recall curve. They have a distinct sawtooth shape. If the (k + 1 )th doc retrieved is non- relevant, then recall is the same, but precision has dropped. If it is relevant, then both precision and recall increase, and the curve jags up and to the right.

Question 8

Q

Interpolated precision

Answer

A

To remove the jiggles in a precision-recall curve.

Question 9

Q

Mean average precision

Answer

A

For one query: Take the average of precision value obtained for the top K docs, each time a relevant doc is retrieved. For all the queries, take the average of the values obatained with the above scheme, then this is the mean average precision. Therefore, each query counts equally.

Question 10

Q

Precision at K

Answer

A

Precision at k documents (P@k) is still a useful metric (e.g P@10 corresponds to the number of relevant results on the first search results page), but fails to take into account the positions of the relevant documents among the top k.

Question 11

Q

R precision

Answer

A

To cope with the precision at K docs, is R-precision. It requires having a set of known relevant docs Rel, from which we calculate the precision of the top Rel documents returned.

Question 12

Q

Pooling

Answer

A

Given info needs and docs, you need to collect relevance assessments (over- slag). This is a time-consuming and expensive process involving human beings. For a large collection, it is usual for relevance to be assessed only for a subset of the docs for each query. The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k docs returned by a number of different IR system.

Question 13

Q

Kappa statistic

Answer

A

A measue of how much agreement between judges there is on relevance judg- ments. It is designed for categorical judgments and corrects a simple agree- ment rate for the rate of chance agreement.

08 Evaluation Flashcards

(13 cards)