W2 Evaluation Flashcards by Anthony Shaoxuan Zhang

1) What are we evaluating for IR system?

2) What are the evaluation metrics for IR?

1) the ranking and the relevance

2) there are 2 types of metrics:

A. Set Metrics: cut off top-k items as a set:
Precision@k, Recall@k, F1 score
For a given query we have: set of relevant documents (relevant/non-relevant), set of retrieved docs up to certain rank k

B. Ranked Lists
MRR (one relevant item per query/only the highest ranked relevant result counts)
MAP (Mean Average Precision): Summerize the Precision-Recall Curve

These are used for binary relevance (relevant - non-relevant)

How well did you know this?

Not at all

Perfectly

How do you calculate precision?

Precision = relevant docs in top-k / retrieved top-k docs

How well did you know this?

Not at all

Perfectly

What’s the equation for the F1-score?

F1 = 2 * (precision*recall)/precsion+recall

How well did you know this?

Not at all

Perfectly

Give the equation for MRR (for 1 AND more relevant items)

RR (Reciprocal Rank) = 1/rank of relevant item or 1/rank of highest ranked relevant item
MRR (Mean Reciprocal Rank) = average over a set of queries

How well did you know this?

Not at all

Perfectly

What’s the equation for recall?

Recall = relevant docs in top-k / All relevant docs in collection

How well did you know this?

Not at all

Perfectly

What are the limitation of precision and recall for search engine evaluation?

Relevance assessments tend to be incomplete -> recall is unknown
Ranking is not taken into account: position 2 is more important than position 10

How well did you know this?

Not at all

Perfectly

For multi-level relevance, we can use:
1. cumulative gain. What is that?
2. Discounted Cumulative Gain (DCG). How to calculate it?
3. Normalized Discounted Cumulative Gain (nDCG). How to calculate it?

sum of relevance judgments of retrieved documents

CG(L) = \sum r_i

where n is the number of results in the ranked list that we consider (CG@n) and ri = relevance grade for result i

Gain of a doc degrades with its rank. The lower the rank, lower probability the user sees it

DCG = r_1 + \sum (r_i / log(2) i)

Since query has different scale, it’s best to have 0-1 scale to compare.
nDCG = DCG/iDCG

iDCG = the DCG for the ideally ranked list (first all highly-relevant, then relevant, then non-relevant)

How well did you know this?

Not at all

Perfectly

What’s a test collection? Name 2 most used one.

A test collection has:
* Collection of docs/items
* Set of information need (queries/topics)
* relevance judgments for the needs (qrels)

TREC
Goal: let teams of researchers evaluate their method on a standardized test collection for a task
Ø Neutral evaluation platform
Ø Relevance assessments are collected from participants, can be re-used for years
Ø Multiple ‘tracks’ (= tasks)
MS MARCO
Most used collection for training and evaluating ranking models.
Official Evaluation metric: MRR@10

The information needs are anonymized natural language questions drawn from Bing’s query logs. (often ambiguous, poorly formulated, may have typos and errors)
In the training set, there are a total of 532.8K (query, relevant
passage) pairs over 502.9K unique queries

How well did you know this?

Not at all

Perfectly

What is the drawback of MS MARCO?

The relevance assessments are sparse (shallow): many queries, but on average, only one relevant judgment per query.
Consequences:
1. Model trianing need both positive and negative examples. But the negative ones in the collection are not necessarily irrelevant, they are just not labelled as relevant;
2. Difficult to see difference between models with only one explicit relevance label for each query

How well did you know this?

Not at all

Perfectly

When using Precision@10, what assumptions do we make about the user?

The user only view top-10 results
The user doesn’t care about the ranking with the top-10
The user doesn’t care about recall/all relevant docs are retrieved

How well did you know this?

Not at all

Perfectly

The university asks you what the quality of the search engine is.
1. How would you answer that question?
2. What are the challenges?

We need queries, documents and a set of relevance assessments.
@ Queries and docs:
*sufficient no. of queries
* queries & docs need to be representative of real information need
* sufficient relevant docs per query

@ Complete Relevance Judgements:
Ideal: a judgement for each doc in the collection for each query
But for real collection that’s not possible, so we can create a pool of docs per query retrieved by multiple baseline IR systems

How well did you know this?

Not at all

Perfectly

What is NDCG? Give the equations

Here’s the Normalized Discounted Cumulative Gain (NDCG) formula in plain text, along with the sub-equations:
NDCG = DCG / IDCG

Where:

DCG (Discounted Cumulative Gain) is calculated as:
DCG = rel_1 + sum(rel_i / log2(i+1)) for i in range(2, n)

IDCG (Ideal Discounted Cumulative Gain) is calculated by sorting the relevance scores in descending order and applying the DCG formula to the sorted scores.

Here are the sub-equations:

rel_i: The relevance score of the i-th ranked document.
log2(x): The logarithm function with base 2.
i: The rank position of the document, starting from 1.
n: The total number of documents.

How well did you know this?

Not at all

Perfectly

Consider the following table representing the relevance scores of documents in a search result:

1 | 3
2 | 2
3 | 1
4 | 2

Calculate the Discounted Cumulative Gain (DCG) for this ranking.

To calculate the DCG, we apply the DCG formula, which sums up the relevance scores of the ranked documents discounted by the logarithm of their positions. Let’s calculate the DCG for the given table:

DCG = rel_1 + sum(rel_i / log2(i+1)) for i in range(2, n)

Calculating the DCG:

DCG = 3 + (2 / log2(2+1)) + (1 / log2(3+1)) + (2 / log2(4+1))
= 3 + (2 / 1.585) + (1 / 2) + (2 / 2.322)
= 3.5021231561

Therefore, the Discounted Cumulative Gain (DCG) for the given ranking is approximately 3.5021231561.

How well did you know this?

Not at all

Perfectly

How do you calculate Average Precision?

Identify the positions in the ranked list where the relevant documents are retrieved (at each point in the list where recall increases).
2 . Calculate the precision at each position where a relevant document is retrieved. Precision at a particular position k is the number of relevant documents retrieved up to position k divided by k.
3 . Sum up the precision scores calculated in step 2.
Divide the sum from step 3 by the total number of relevant documents in the collection.

Here equation for Average Precision (AP):

AP = Σ(P(k) * rel(k)) / #relevant items in collection, for k in range n

Where:

P(k) represents the precision at position k.
rel(k) is a binary indicator (1 or 0) representing whether the document at position k is relevant or not.
n is the total number of documents in the ranked list.
#relevant items in the collection refers to the total number of relevant documents present.

How well did you know this?

Not at all

Perfectly

Consider the following table representing the relevance of documents in a search result:

1 | Yes
2 | No
3 | Yes
4 | No
5 | Yes

Calculate the Average Precision (AP) for this retrieval using the revised equation.

To calculate the Average Precision (AP) using the revised equation, we need to identify the positions where relevant documents are retrieved and calculate the precision at those positions. Let’s calculate the AP for the given table:

Relevant Documents: 3 (documents with relevance “Yes”)
Precision_at_k: Precision at rank position k, which is the number of relevant documents retrieved up to position k divided by k.

Calculating the AP:

Precision_at_1 = 1 / 1 = 1.0
Precision_at_3 = 2 / 3 = 0.67
Precision_at_5 = 3 / 5 = 0.6

AP = (1.0 + 0.67 + 0.6) / 3 ≈ 0.756

Therefore, the Average Precision (AP) for the given retrieval is approximately 0.756.

How well did you know this?

Not at all

Perfectly

Ø The following list of Rs and Ns represent relevant (R) and
nonrelevant (N) returned documents in a ranked list of 20
documents retrieved in response to a query from a collection of
10,000 documents. The top of the ranked list is on the left of the
list. This list shows 6 relevant documents. Assume there are 8
relevant documents in total in the collection.
R R N N N N N N R N R N N N R N N N N R

Ø Calculate (show your calculations):
a. Precision
b. Recall
c. F1

Study These Flashcards

a. Precision = 6/20 = 0.3
b. Recall = 6/8 = 0.75
c. F1 = 2(0.30.75)/(0.3+0.75) ≈ 0.43

Ø Calculate the average precision

Study These Flashcards

AP = (1/1+2/2+3/9+4/11+5/15+6/20) / 8 ≈ 0.42

Ø Retrieval results are ranked by their
estimated relevance
Ø A good model ranks relevant documents
higher than non-relevant documents

What are the two types of metrics?

Study These Flashcards

Ø Ranking metrics: look at the rank of the
retrieved documents. The higher the relevant
documents, the better
Ø Set metrics: cut off the top-k (e.g. top-10)
items and look at these as a set

Ø Retrieval results are ranked by their
estimated relevance
Ø A good model ranks relevant documents
higher than non-relevant documents

What’s the difference between ranking metrics and set metrics?

Study These Flashcards

Ø Ranking metrics: look at the rank of the
retrieved documents. The higher the relevant
documents, the better
Ø Set metrics: cut off the top-k (e.g. top-10)
items and look at these as a set

Name 2 set metrics

Study These Flashcards

Precision and recall

Ø We have a collection with 16 relevant documents.
Ø Our search engine retrieves the following top-10:
R R N R R N N R N N
1. Calculate precision@10 and recall@10.
2. Show the calculation of F1

Study These Flashcards

Ø Precision@10 = 5/10 = 0.5
Ø Recall@10 = 5/16 = 0.3125
Ø F 1 = 2(5/105/16)/(5/10+5/16) ≈ 0.38

Ø What are the limitations of precision and recall for search engine
evaluation? Fill in the blanks.

Ø ___1___ assessments tend to be incomplete -> recall is unknown
Ø ___2___ is not taken into account:

Study These Flashcards

1: Relevance
2: Ranking is not taken into account: a document retrieved in position 50
is less useful to a user than a document retrieved in position 2

Draw precision-recall curve

Study These Flashcards

solution

Ø We have a collection with 16 relevant documents.
Ø Our search engine retrieves the following top-10:
Ø R R N R R N N R N N
Ø Show the calculation of Average Precision

Study These Flashcards

AveP = (1/1+2/2+3/4+4/5+5/8)/16=0.26

Ø The following list of Rs and Ns represent relevant (R) and nonrelevant (N) returned documents in a ranked list of 20 documents retrieved in response to a query from a collection of 10,000 documents. The top of the ranked list is on the left of the list. This list shows 6 relevant documents. Assume there are 8 relevant documents in total in the collection. R R N N N N N N R N R N N N R N N N N R Ø Calculate the cumulative gain

CG@20 = 6

Ø The following list of Rs and Ns represent relevant (R) and nonrelevant (N) returned documents in a ranked list of 20 documents retrieved in response to a query from a collection of 10,000 documents. The top of the ranked list is on the left of the list. This list shows 6 relevant documents. Assume there are 8 relevant documents in total in the collection. R R N N N N N N R N R N N N R N N N N R Ø Calculate the normalized discounted cumulative gain

DCG@20 = 1+1/log2 (2)+1/log2 (9)+1/log2 (11)+1/log2 (15)+1/log2 (20) ≈ 3.1 nDCG@20 = DCG/iDCG iDCG = 1+1/log2 (2)+1/log2 (3)+1/log2 (4)+1/log2 (5)+1/log2 (6) +1/log2 (7) +1/log2 (8) ≈ 4.6 nDCG@20 ≈ 3.1/4.6 ≈ 0.67

W2 Evaluation Flashcards

(26 cards)