Evaluation of IR Systems Flashcards

Question 1

Q

How can you tell qualitatively if users are happy with your system?

Answer

A

Search returns relevant results
Search results get clicked a lot
Users buy something after using the search
You get repeat visitors

Question 2

Q

How is relevance asssessed?

Answer

A

Relative to the user need not the query provided

Question 3

Q

What are some reasons we evaluate our systems?

Answer

A

To assess the actual utility of the retrieval system for users
To compare different systems and methods

Question 4

Q

What should be measured in an information retrieval system?

Answer

A

Effectiveness/accuracy: how relevant are the search results
Efficiency: How quickly can a user get results? How much resources are needed to answer the query
Usability: How useful is the system for real user tasks?

Question 5

Q

What is precision and recall?

Answer

A

Measures for assessing IR performance by looking at accuracy
Precision = TP/ (TP + FP)
Recall = TP / (TP + FN)

Question 6

Q

What is the precision/recall tradeoff?

Answer

A

High recall tends to be associated with low precision.
Increasing the number of docs retrieved will always lead to equal or higher recall so retrieving all would get us 100% recall with bad precision.
It is also easy to get high precision with low recall

Question 7

Q

What is the F-measure and what is the equation for the F-1 score?

Answer

A

Allows us to trade off precision and recall with a single measure
F1 = (2PR)/(P+R)

F = ((B^2+1)P*R)/(B^2P+R)

Question 8

Q

Why are precision and recall metrics often meaningless and what can we do instead?

Answer

A

Meaningless because the metrics don’t take into account any context of the system’s use case.
Instead, it is more informative to compare the ranking of documents by each system.
It evaluates the relevance of documents retrieved as well as their order of retrieval

Question 9

Q

What is average precision and how do we calculate it?

Answer

A

It is the standard measure for comparing two ranking methods for a single query.
Calculate the sum of precision values at each point where a relevant document was retrieved and divide by the number of relevant documents in the set

Question 10

Q

What is mean average precision (MAP) and how do we calculate it?

Answer

A

Mean of average precision over a set of queries

Question 11

Q

What is discounted cumulative gain?

Answer

A

A method for evaluating information retrieval when there are multiple levels of relevancy. Gain measures how much relevant information a user can gain by looking at each document

Question 12

Q

What are the 2 assumptions behind discounted cumulative gain?

Answer

A

Highly relevant documents are more useful than marginally relevant documents
The lower the ranked position of a relevant document, the less useful it is for the user since it is less likely to be examined

Question 13

Q

How do we calculate cumulative gain?

Answer

A

The sum of relevancy scores of documents retrieved where higher scores mean more relevance

Question 14

Q

How do we calculate discounted cumulative gain?

Answer

A

Discounting each relevance score by something depending on its rank
Typical discount is 1/log(rank)

DCG = r1 + r2/log 2 + r3/log 3 + … rn/log n

Question 15

Q

What is the ideal discounted cumulative gain?

Answer

A

The DCG associated with the best possible ranking of the documents. Documents should be ranked such that high relevancy scores are at the top of the list

Question 16

Q

How do we calculate the normalized discounted cumulative gain?

Answer

A

DCG/IdealDCG at the same point

Question 17

Q

Why do we use normalized discounted cumulative gain?

Answer

A

TO measure the total utility of the top k documents to a user
Discount the utility of a lowly ranked document
Ensures compatibility across queries with different numbers of relevant documents

Question 18

Q

What is the issue with human judgements of the relevance of information retrieved?

Answer

A

Human judgements are expensive, inconsistent between raters and over time, and are not always representative of real users of a system

Question 19

Q

What is the process for pooling to avoid judging all documents in a collection?

Answer

A

Choose a diverse set of ranking methods
Have each return the top k documents
Combine all top k to form a pool for human assessors to judge
Other documents are usually assumed to be non-relevant
This is ok for comparing systems that contributed to the pool but is problematic for evaluating new systems