C2: evaluation Flashcards

Question 1

Q

Cranfield evaluation methodology

Answer

A

a strategy for laboratory testing of system components

build reusable test collections
define evaluation metrics for these collections

content of test collection:
- collection of documents similar to a real document collection in a search application
- sample set of queries or topics that simulate the user’s information need
- relevance judgments/assessments: which documents should be returned for which queries

Question 2

Q

2 types of metrics

Answer

A

ranking metrics: look at the rank of the rerieved documents, the higher the relevant documents, the better
set metrics: cut off the top-k items and look at these as a set

Question 3

Q

2 evaluation metrics for set metrics

Answer

A

retrieved documents in top-k

precision@k = # retrieved and relevant documents in top-k / # retrieved documents

recall@k = # retrieved and relevant documents in top-k /
# relevant documents

Question 4

Q

limitations of precision and recall for search engine evaluation

Answer

A

relevance assessments tend to be incomplete => recall unknown (we don’t know all relevant documents)
ranking is not taken into account: a document retrieved in position 50 is less useful to a user than a document retrieved in position 2

Question 5

Q

MRR

Answer

A

Mean Reciprocal Rank
for tasks with only one relevant item per query

RR = 1 / rank of relevant item or highest ranked relevant item

MRR = average over a set of queries

Question 6

Q

Precision-recall curve

Answer

A

drawn as a continuously decreasing function, using interpolated precision: the highest precision found for any recall level

Question 7

Q

AP

Answer

A

Average Precision
1. calculate precision at the position of each relevant retrieved document (at each point in the ranked list where recall increases)
2. sum over these precision scores
3. divide by the total number of relevant documents in the collection

Question 8

Q

MAP

Answer

A

Mean Average Precision
calculate AP per query and take the mean over all queries

Question 9

Q

cumulative gain

Answer

A

sum of relevance judgements of retrieved documents: sum(r_i)

when relevance judgements are multi-level

assumption: highly relevant results contribute more than slightly relevant results

Question 10

Q

discounted cumulative gain

Answer

A

DCG(L) = r_1 + sum(r_i/log_2(i))
r_i is the relevance grade for result i

the lower in the list, the lower the probability that the user sees it

assumption: the gain of a document should degrade with its rank

Question 11

Q

normalized discounted cumulative gain

Answer

A

the effect of DCG depends on the number of relevant documents for a query

nDCG(L) = DCG(L) / iDCG
where iDCG is the DCG for the ideally ranked list (first all highly relevant documents, then relevant, slightly relevant, non-relevant)

Question 12

Q

TREC

Answer

A

Text Retrieval Conference
Goal: let teams of researchers evaluate their method on a standardized test collection for a task

Relevance assessments are collected from participants, can be re-used for years

Question 13

Q

assumptions about a user who uses precision@k as a metric

Answer

A

the user will only view the top-10 results
the user does not care about ranking within the top-10
the user does not care about recall/if all the relevant results are retrieved

Question 14

Q

challenges when setting up the evaluation of a retrieval system

Answer

A

queries and documents:
- sufficient number of queries
- queries and documents need to be representative of real users’ information need
- sufficient relevant documents per query

relevance judgements:
- ideal: a judgment for each document in the collection for each query, but this is infeasible with real collections
- alternative: create a pool of documents per query, retrieved by multiple (baseline) retrieval systems and let those be judged

C2: evaluation Flashcards

(14 cards)