IR – Introduction, Evaluation Flashcards
Lection 3
What is Information Retrieval?
IR is finding the most relevant n-number of documents from the collection of documents based on the given query.
What is Inverted Index?
It stored all statistics PER TERM:
- document frequency: how many documents contain the term
- term frequency per document: how often does the term appear per document?
- document length
- average document length
What is a posting list?
It is a structure where Inverted index is stored. For each term in the dictionary, there is a posting list of docId:num where docId is the id of a document and num is a number of times that term appears in document with document id.
Explain Bag-of-words
It counts the number of occurrences for each term per document
Explain the logic behind Term Frequency and formula.
TF t,d = how often does term t appear in document d
The more it appears in the document, the document is more relevant. Experiments show that using the logarithm is more effective than raw counting:
log(1 + TF t,d)
Explain Inverse Document Frequency
The main idea is that rare words are more informative than common words. If a word contains a very rare word, it is highly likely that the document is relevant. That means, we need a way to find if a word is rare or not. We do that by calculating in how many documents this word appears:
DF t = in how many documents does term t appear
IDF t = log (D / DF t)
D is the total number of documents. Logarithm is used because it shows better effectivness
What is TF-IDF?
Term frequency - inverse document frequency is a metric that shown how relevant a document is to some query.
TF_IDF (q, d) = sum(log(1+TF t,d) * log (D/DF t))
The relevance increases the word appears more in the document, and it increases if the word is rare (rarely appears in the corpus of documents)
What is BM25?
Improved version of TF-IDF
It contains two hyperparameters which have to be set by developers:
k1 - controls term frequency scaling
b - controls document length normalization (b=0 -> no length normalization, b=1 -> relative frequency, fully scale by document length)
TF_IDF increases with increasing term frequency while BM25 becomes constant much sooner (after word appears 12-15 times, the relevance shouldn’t increase)
Precision vs Recall
Precision: How many of the selected items are relevant: TP / TP + FP
Recall: How many of the total relevant items are selected?
TP / TP + FN
Recall can be very high if we increase the number of retrieved documents (if we retrieve all possible documents, we can recall of 1 because all relevant documents will be retrieved).
In other contrast, if we retrieve 1-2 documents where we are 99.999% sure are relevany, the precision will be high but if there are 200 other relevant documents, the recall will be low.
What is F1 score?
In is a combination of recall and precision:
2 * (P * R) / (P + R).
It reduces the inconsistancies of precision and recall.
What is MRR?
Mean Reciprocal Rank is a binary evaluation metric that only looks at the first relevany document, doesn;t care about the rest of the retrieved documents.
(T F T F F) -> RR = 1
(F F T F F) -> RR = 1/3
(F T T F F) -> RR = 1/2
MRR = (1 + 1/3 + 1/2) / 3
What is MAP?
Mean Average Precision is a binary evaluation metric that looks at all retrieved relevant documents. It is difficult to interpret:
Assume there are total of 2 relevant documents:
(T F T F F) -> AP = (1 + 2/3) / 2
(F F T F F) -> AP = (1/3 + 0) / 2
(F T T F F) -> AP = (1/2 + 1/3) / 2
MAP = mean of them
What is nDCG
Problem with binary evaluation metrics: some documents might be more relevant that others. Labeling has to be done according to some rating: 0 = irrelevant, 3 = perfectly relevant
Assuming two relevant documents exist with relevance 3 and 1.
Ideal would be first to be retrieved 3 and then 1
Ideal DCG = 3 / log2(1 + 1) + 3 / log2(2 + 1) = 3.63
(1 0 3 0 0) -> nDCG = (1/log2(1+1) + 3/log2(3 + 1)) / idealDCG = 0.69
(0 0 1 0 0) -> nDCG = 1/log2(1 + 3) / idealDCG
(0 3 1 0 0) -> nDCG = 3/log2(1 + 2) + 1/log2(1 + 3) / idealDCG