C2: evaluation Flashcards

1
Q

Cranfield evaluation methodology

A

a strategy for laboratory testing of system components

  • build reusable test collections
  • define evaluation metrics for these collections

content of test collection:
- collection of documents similar to a real document collection in a search application
- sample set of queries or topics that simulate the user’s information need
- relevance judgments/assessments: which documents should be returned for which queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2 types of metrics

A
  1. ranking metrics: look at the rank of the rerieved documents, the higher the relevant documents, the better
  2. set metrics: cut off the top-k items and look at these as a set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

2 evaluation metrics for set metrics

A

retrieved documents in top-k

precision@k = # retrieved and relevant documents in top-k / # retrieved documents

recall@k = # retrieved and relevant documents in top-k /
# relevant documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

limitations of precision and recall for search engine evaluation

A
  • relevance assessments tend to be incomplete => recall unknown (we don’t know all relevant documents)
  • ranking is not taken into account: a document retrieved in position 50 is less useful to a user than a document retrieved in position 2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

MRR

A

Mean Reciprocal Rank
for tasks with only one relevant item per query

RR = 1 / rank of relevant item or highest ranked relevant item

MRR = average over a set of queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Precision-recall curve

A

drawn as a continuously decreasing function, using interpolated precision: the highest precision found for any recall level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

AP

A

Average Precision
1. calculate precision at the position of each relevant retrieved document (at each point in the ranked list where recall increases)
2. sum over these precision scores
3. divide by the total number of relevant documents in the collection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

MAP

A

Mean Average Precision
calculate AP per query and take the mean over all queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

cumulative gain

A

sum of relevance judgements of retrieved documents: sum(r_i)

when relevance judgements are multi-level

assumption: highly relevant results contribute more than slightly relevant results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

discounted cumulative gain

A

DCG(L) = r_1 + sum(r_i/log_2(i))
r_i is the relevance grade for result i

the lower in the list, the lower the probability that the user sees it

assumption: the gain of a document should degrade with its rank

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

normalized discounted cumulative gain

A

the effect of DCG depends on the number of relevant documents for a query

nDCG(L) = DCG(L) / iDCG
where iDCG is the DCG for the ideally ranked list (first all highly relevant documents, then relevant, slightly relevant, non-relevant)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

TREC

A

Text Retrieval Conference
Goal: let teams of researchers evaluate their method on a standardized test collection for a task

Relevance assessments are collected from participants, can be re-used for years

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

assumptions about a user who uses precision@k as a metric

A
  • the user will only view the top-10 results
  • the user does not care about ranking within the top-10
  • the user does not care about recall/if all the relevant results are retrieved
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

challenges when setting up the evaluation of a retrieval system

A

queries and documents:
- sufficient number of queries
- queries and documents need to be representative of real users’ information need
- sufficient relevant documents per query

relevance judgements:
- ideal: a judgment for each document in the collection for each query, but this is infeasible with real collections
- alternative: create a pool of documents per query, retrieved by multiple (baseline) retrieval systems and let those be judged

How well did you know this?
1
Not at all
2
3
4
5
Perfectly