W2 Evaluation Flashcards
1) What are we evaluating for IR system?
2) What are the evaluation metrics for IR?
1) the ranking and the relevance
2) there are 2 types of metrics:
A. Set Metrics: cut off top-k items as a set:
Precision@k, Recall@k, F1 score
For a given query we have: set of relevant documents (relevant/non-relevant), set of retrieved docs up to certain rank k
B. Ranked Lists
MRR (one relevant item per query/only the highest ranked relevant result counts)
MAP (Mean Average Precision): Summerize the Precision-Recall Curve
These are used for binary relevance (relevant - non-relevant)
How do you calculate precision?
Precision = relevant docs in top-k / retrieved top-k docs
What’s the equation for the F1-score?
F1 = 2 * (precision*recall)/precsion+recall
Give the equation for MRR (for 1 AND more relevant items)
RR (Reciprocal Rank) = 1/rank of relevant item or 1/rank of highest ranked relevant item
MRR (Mean Reciprocal Rank) = average over a set of queries
What’s the equation for recall?
Recall = relevant docs in top-k / All relevant docs in collection
What are the limitation of precision and recall for search engine evaluation?
- Relevance assessments tend to be incomplete -> recall is unknown
- Ranking is not taken into account: position 2 is more important than position 10
For multi-level relevance, we can use:
1. cumulative gain. What is that?
2. Discounted Cumulative Gain (DCG). How to calculate it?
3. Normalized Discounted Cumulative Gain (nDCG). How to calculate it?
- sum of relevance judgments of retrieved documents
CG(L) = \sum r_i
where n is the number of results in the ranked list that we consider (CG@n) and ri = relevance grade for result i
- Gain of a doc degrades with its rank. The lower the rank, lower probability the user sees it
DCG = r_1 + \sum (r_i / log(2) i)
- Since query has different scale, it’s best to have 0-1 scale to compare.
nDCG = DCG/iDCG
iDCG = the DCG for the ideally ranked list (first all highly-relevant, then relevant, then non-relevant)
What’s a test collection? Name 2 most used one.
A test collection has:
* Collection of docs/items
* Set of information need (queries/topics)
* relevance judgments for the needs (qrels)
- TREC
Goal: let teams of researchers evaluate their method on a standardized test collection for a task
Ø Neutral evaluation platform
Ø Relevance assessments are collected from participants, can be re-used for years
Ø Multiple ‘tracks’ (= tasks) - MS MARCO
Most used collection for training and evaluating ranking models.
Official Evaluation metric: MRR@10
The information needs are anonymized natural language questions drawn from Bing’s query logs. (often ambiguous, poorly formulated, may have typos and errors)
In the training set, there are a total of 532.8K (query, relevant
passage) pairs over 502.9K unique queries
What is the drawback of MS MARCO?
The relevance assessments are sparse (shallow): many queries, but on average, only one relevant judgment per query.
Consequences:
1. Model trianing need both positive and negative examples. But the negative ones in the collection are not necessarily irrelevant, they are just not labelled as relevant;
2. Difficult to see difference between models with only one explicit relevance label for each query
When using Precision@10, what assumptions do we make about the user?
- The user only view top-10 results
- The user doesn’t care about the ranking with the top-10
- The user doesn’t care about recall/all relevant docs are retrieved
The university asks you what the quality of the search engine is.
1. How would you answer that question?
2. What are the challenges?
- We need queries, documents and a set of relevance assessments.
- @ Queries and docs:
*sufficient no. of queries
* queries & docs need to be representative of real information need
* sufficient relevant docs per query
@ Complete Relevance Judgements:
Ideal: a judgement for each doc in the collection for each query
But for real collection that’s not possible, so we can create a pool of docs per query retrieved by multiple baseline IR systems
What is NDCG? Give the equations
Here’s the Normalized Discounted Cumulative Gain (NDCG) formula in plain text, along with the sub-equations:
NDCG = DCG / IDCG
Where:
DCG (Discounted Cumulative Gain) is calculated as: DCG = rel_1 + sum(rel_i / log2(i+1)) for i in range(2, n) IDCG (Ideal Discounted Cumulative Gain) is calculated by sorting the relevance scores in descending order and applying the DCG formula to the sorted scores.
Here are the sub-equations:
rel_i: The relevance score of the i-th ranked document. log2(x): The logarithm function with base 2. i: The rank position of the document, starting from 1. n: The total number of documents.
Consider the following table representing the relevance scores of documents in a search result:
1 | 3
2 | 2
3 | 1
4 | 2
Calculate the Discounted Cumulative Gain (DCG) for this ranking.
To calculate the DCG, we apply the DCG formula, which sums up the relevance scores of the ranked documents discounted by the logarithm of their positions. Let’s calculate the DCG for the given table:
DCG = rel_1 + sum(rel_i / log2(i+1)) for i in range(2, n)
Calculating the DCG:
DCG = 3 + (2 / log2(2+1)) + (1 / log2(3+1)) + (2 / log2(4+1))
= 3 + (2 / 1.585) + (1 / 2) + (2 / 2.322)
= 3.5021231561
Therefore, the Discounted Cumulative Gain (DCG) for the given ranking is approximately 3.5021231561.
How do you calculate Average Precision?
- Identify the positions in the ranked list where the relevant documents are retrieved (at each point in the list where recall increases).
2 . Calculate the precision at each position where a relevant document is retrieved. Precision at a particular position k is the number of relevant documents retrieved up to position k divided by k.
3 . Sum up the precision scores calculated in step 2. - Divide the sum from step 3 by the total number of relevant documents in the collection.
Here equation for Average Precision (AP):
AP = Σ(P(k) * rel(k)) / #relevant items in collection, for k in range n
Where:
P(k) represents the precision at position k. rel(k) is a binary indicator (1 or 0) representing whether the document at position k is relevant or not. n is the total number of documents in the ranked list. #relevant items in the collection refers to the total number of relevant documents present.
Consider the following table representing the relevance of documents in a search result:
1 | Yes
2 | No
3 | Yes
4 | No
5 | Yes
Calculate the Average Precision (AP) for this retrieval using the revised equation.
To calculate the Average Precision (AP) using the revised equation, we need to identify the positions where relevant documents are retrieved and calculate the precision at those positions. Let’s calculate the AP for the given table:
Relevant Documents: 3 (documents with relevance “Yes”)
Precision_at_k: Precision at rank position k, which is the number of relevant documents retrieved up to position k divided by k.
Calculating the AP:
Precision_at_1 = 1 / 1 = 1.0
Precision_at_3 = 2 / 3 = 0.67
Precision_at_5 = 3 / 5 = 0.6
AP = (1.0 + 0.67 + 0.6) / 3 ≈ 0.756
Therefore, the Average Precision (AP) for the given retrieval is approximately 0.756.