C3: Boolean model and index construction Flashcards

Question 1

Q

term-document incidence matrix

Answer

A

rows are terms, columns are document, a 1 indicates the occurrence of that term in the document

But this explodes in size for many documents and the matrix is sparse => better to only record non-zero positions

Question 2

Q

inverted index

Answer

A

for each term, store a list of all documents that contain it (postings)

Question 3

Q

evaluation of a Boolean IR system

Answer

A

Boolean model does not perform ranking

precision: fraction of retrieved documents relevant to the user’s information need

recall: fraction of relevant documents in the collection that are retrieved

Question 4

Q

disadvantages of Boolean retrieval

Answer

A

based on binary decision criteria with no notion of partial matching
no ranking of the documents
information need has to be translated into a Boolean expression

Question 5

Q

positional indexes

Answer

A

Useful if we want to match phrases (New York University)

in the postings, store for each term the positions in which tokens of it appear

Question 6

Q

Zipf’s law

Answer

A

while only a few words are used very often, many are used rarely

if words are ranked by their frequency, the product of the rank number and frequency make up a constant: the k-th most frequent term has a frequency proportional to 1/k

Question 7

Q

Huffman coding

Answer

A

Create a leaf node for each symbol and add it to the
priority queue.
While there is more than one node in queue:
1. Remove two nodes of lowest probability from queue
2. Create new internal node with these two nodes as children => new frequency is sum of the two nodes’ frequencies
3. Add new node to queue

Question 8

Q

entropy

Answer

A

minimum number of bits per symbol for lossless encoding
H(P) = - sum P(x) log P(x)

Question 9

Q

why would we want index compression?

Answer

A

inverted lists are very large
compression of indexes saves disk and memory space
best compression techniques have good compression ratios and are easy to decompress

Question 10

Q

delta encoding

Answer

A

encode differences between document numbers (gaps): differences for high-frequency words are easier to compress and differences for low-frequency words are large

most gaps can be encoded with far fewer than 20 bits

Question 11

Q

variable length encoding

Answer

A

encode every gap integer with as few bits as needed: use the first bit of a byte as continuation bit, to indicate if the value continues in the next byte or not (7 bits left for encoding)

C3: Boolean model and index construction Flashcards

(11 cards)