C2: evaluation Flashcards
Cranfield evaluation methodology
a strategy for laboratory testing of system components
- build reusable test collections
- define evaluation metrics for these collections
content of test collection:
- collection of documents similar to a real document collection in a search application
- sample set of queries or topics that simulate the user’s information need
- relevance judgments/assessments: which documents should be returned for which queries
2 types of metrics
- ranking metrics: look at the rank of the rerieved documents, the higher the relevant documents, the better
- set metrics: cut off the top-k items and look at these as a set
2 evaluation metrics for set metrics
retrieved documents in top-k
precision@k = # retrieved and relevant documents in top-k / # retrieved documents
recall@k = # retrieved and relevant documents in top-k /
# relevant documents
limitations of precision and recall for search engine evaluation
- relevance assessments tend to be incomplete => recall unknown (we don’t know all relevant documents)
- ranking is not taken into account: a document retrieved in position 50 is less useful to a user than a document retrieved in position 2
MRR
Mean Reciprocal Rank
for tasks with only one relevant item per query
RR = 1 / rank of relevant item or highest ranked relevant item
MRR = average over a set of queries
Precision-recall curve
drawn as a continuously decreasing function, using interpolated precision: the highest precision found for any recall level
AP
Average Precision
1. calculate precision at the position of each relevant retrieved document (at each point in the ranked list where recall increases)
2. sum over these precision scores
3. divide by the total number of relevant documents in the collection
MAP
Mean Average Precision
calculate AP per query and take the mean over all queries
cumulative gain
sum of relevance judgements of retrieved documents: sum(r_i)
when relevance judgements are multi-level
assumption: highly relevant results contribute more than slightly relevant results
discounted cumulative gain
DCG(L) = r_1 + sum(r_i/log_2(i))
r_i is the relevance grade for result i
the lower in the list, the lower the probability that the user sees it
assumption: the gain of a document should degrade with its rank
normalized discounted cumulative gain
the effect of DCG depends on the number of relevant documents for a query
nDCG(L) = DCG(L) / iDCG
where iDCG is the DCG for the ideally ranked list (first all highly relevant documents, then relevant, slightly relevant, non-relevant)
TREC
Text Retrieval Conference
Goal: let teams of researchers evaluate their method on a standardized test collection for a task
Relevance assessments are collected from participants, can be re-used for years
assumptions about a user who uses precision@k as a metric
- the user will only view the top-10 results
- the user does not care about ranking within the top-10
- the user does not care about recall/if all the relevant results are retrieved
challenges when setting up the evaluation of a retrieval system
queries and documents:
- sufficient number of queries
- queries and documents need to be representative of real users’ information need
- sufficient relevant documents per query
relevance judgements:
- ideal: a judgment for each document in the collection for each query, but this is infeasible with real collections
- alternative: create a pool of documents per query, retrieved by multiple (baseline) retrieval systems and let those be judged