08 Evaluation Flashcards
Gold standard/ground truth
The decision of wether a document is either relevant or non-relevant.
Development test collection
Used to evaluate the IR system. Many sytems contain various weights (pa- rameters) that can be used to tune system performance. But one cant tune these parameters to maximize performance on that collection. Therefore, one must have a develpment test collection. (As in machine learning, where you have a test set that is different from the training set).
Precision
Precision = number of relevant retrieved items / number of retrieved items
Recall
Recall = number of relevant items retrieved / number of relevant items
Accuracy
Accuracy = (tp + tn) / (tp + fp + fn + tn)
Why accuracy is not a good measure for IR problems: Normally, non-relevant documents consists of 99.9% of the collection. To maximize the accuracy one can deem all docs as non-relevant.
Great to have two numbers to evaluate a IR system (precision and recall) because one is ofter more important than the other.
F measure
Precision and recall are often a tradeoff: Recall is a nondecreasing function. On the other hand, precision usually decreases as the number of retreived docs is increased. In general, we want to get some amount of recall while tolerating only a certain percentage of false positives. See wikipedia for for- mula. Values for β < 1 emphasize precision, but β > 1 emphasize recall. One cannot use the arithmetic meain because one can always get 100% recall by returning all the docs, and therfore always get a minimum of 50% arithmetic mean.
Precision-recall curve
For each set of retreived documents one can plot the precision recall curve. They have a distinct sawtooth shape. If the (k + 1 )th doc retrieved is non- relevant, then recall is the same, but precision has dropped. If it is relevant, then both precision and recall increase, and the curve jags up and to the right.
Interpolated precision
To remove the jiggles in a precision-recall curve.
Mean average precision
For one query: Take the average of precision value obtained for the top K docs, each time a relevant doc is retrieved. For all the queries, take the average of the values obatained with the above scheme, then this is the mean average precision. Therefore, each query counts equally.
Precision at K
Precision at k documents (P@k) is still a useful metric (e.g P@10 corresponds to the number of relevant results on the first search results page), but fails to take into account the positions of the relevant documents among the top k.
R precision
To cope with the precision at K docs, is R-precision. It requires having a set of known relevant docs Rel, from which we calculate the precision of the top Rel documents returned.
Pooling
Given info needs and docs, you need to collect relevance assessments (over- slag). This is a time-consuming and expensive process involving human beings. For a large collection, it is usual for relevance to be assessed only for a subset of the docs for each query. The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k docs returned by a number of different IR system.
Kappa statistic
A measue of how much agreement between judges there is on relevance judg- ments. It is designed for categorical judgments and corrects a simple agree- ment rate for the rate of chance agreement.