W3 Boolean Retrieval & Indexing Compression Flashcards
What is this table called?
Word 1 1 0 1 1
Word 2 1 1 0 0
Word 3 0 1 1 1
Word 4 1 0 0 1
Term-Document Incidence Matrix
Consider the following table representing the relevance scores of documents in a search result:
1 | 3
2 | 2
3 | 1
4 | 2
Calculate the Discounted Cumulative Gain (DCG) for this ranking.
To calculate the DCG, we apply the DCG formula, which sums up the relevance scores of the ranked documents discounted by the logarithm of their positions. Let’s calculate the DCG for the given table:
DCG = rel_1 + sum(rel_i / log2(i+1)) for i in range(2, n)
Calculating the DCG:
DCG = 3 + (2 / log2(2+1)) + (1 / log2(3+1)) + (2 / log2(4+1)) = 3 + (2 / 1.585) + (1 / 2) + (2 / 2.322) = 3.5021231561
Therefore, the Discounted Cumulative Gain (DCG) for the given ranking is approximately 3.5021231561.
Recommend a query processing order for the query below, given
the postings list sizes in the table:
(rocket OR launch) AND (sky OR universe) AND (sun OR orbit)
Term | Postings size rocket |81300 launch | 109999 sky |43123 universe | 48491 sun | 62687 orbit | 577513
- rocket OR launch: 81300 + 109999 = 191299
- sky OR universe: 43123 + 48491 = 91614
- sun OR orbit: 62687 + 577513 = 640200
- Process in increasing order of their frequency (sum)
Consider these documents:
Doc 1
new recipe for cookies
Doc 2
cookie dough recipe
Doc 3
new chocolate dough
Doc 4
recipe for chocolate cookies
a)
Draw the term
document incidence matrix for this
document collection
b)
Draw the inverted index representation for this collection.
c)
What is the result set for the query: recipe AND NOT
(chocolate OR
a) The term-document incidence matrix for this document collection can be represented as follows:
| Terms | Doc 1 | Doc 2 | Doc 3 | Doc 4 | |-----------|-------|-------|-------|-------| | recipe | 1 | 1 | 0 | 1 | | cookies | 1 | 0 | 0 | 1 | | dough | 0 | 1 | 1 | 0 | | chocolate | 0 | 0 | 1 | 1 | | new | 1 | 0 | 1 | 0 |
In this matrix, each row represents a term, and each column represents a document. The value in each cell indicates the presence (1) or absence (0) of the term in the corresponding document.
b) The inverted index representation for this collection is as follows:
| Terms | Documents | | recipe | Doc 1, Doc 2, Doc 4 | | cookies | Doc 1, Doc 4 | | dough | Doc 2, Doc 3 | | chocolate | Doc 3, Doc 4 | | new | Doc 1, Doc 3 |
In this representation, each row represents a term, and the corresponding column lists the documents in which the term appears.
c) The result set for the query “recipe AND NOT (chocolate OR dough)” would include documents that contain the term “recipe” but do not contain either “chocolate” or “dough.” Based on the inverted index representation, the result set would include Doc 1, as it contains “recipe” but does not contain either “chocolate” or “dough.” The other documents (Doc 2, Doc 3, and Doc 4) either contain “chocolate” or “dough” and therefore are not included in the result set.
Name 4 methods of compression (relevant for this exam)
- Zipf’s law
- Huffman encoding
- Entropy
- Posting list compression
What’s Zipf’s law?
The k-th most frequent term
has frequency proportional to 1/k
Main idea: store many small numbers and few larger numbers.
What is the equation for collection frequency in the context of Zipf’s law?
TODO: check if correct
CF = (Total number of occurrences of a term in the collection) / (Total number of terms in the collection)
This equation calculates the frequency of a specific term within a collection of documents. It represents the ratio of the total number of times the term appears in the collection to the total number of terms present in the collection. The collection frequency is used to analyze the distribution of term frequencies and their ranks according to Zipf’s law, which states that the frequency of any term is inversely proportional to its rank in the frequency table.
What is Huffman coding?
A frequency sorted binary
tree sorted from the bottom up.
Must watch if not clear (1 minute): https://youtu.be/JsTptu56GM8?t=220
Consider the following postings list for a term “apple”:
Postings list: [10, 15, 18, 22, 25, 28, 32]
Perform postings compression on the given list using delta encoding and variable-length encoding techniques.
TODO: check AI-generated answer
Step 1: Delta Encoding
Apply delta encoding to the postings list by calculating the gaps (differences) between consecutive document identifiers:
Postings list: [10, 15, 18, 22, 25, 28, 32]
Gaps: [10, 5, 3, 4, 3, 3, 4]
Step 2: Variable-Length Encoding
Encode the gaps using variable-length encoding, where smaller values are represented using fewer bits.
Gaps: [10, 5, 3, 4, 3, 3, 4]
Encoded Gaps: [1010, 101, 11, 100, 11, 11, 100]
In this example, the variable-length encoded gaps are represented using a binary format, with each gap converted to its binary equivalent.
The final compressed postings list using delta encoding and variable-length encoding is:
Compressed Postings: [1010, 101, 11, 100, 11, 11, 100]