C3: Boolean model and index construction Flashcards

1
Q

term-document incidence matrix

A

rows are terms, columns are document, a 1 indicates the occurrence of that term in the document

But this explodes in size for many documents and the matrix is sparse => better to only record non-zero positions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

inverted index

A

for each term, store a list of all documents that contain it (postings)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

evaluation of a Boolean IR system

A

Boolean model does not perform ranking

precision: fraction of retrieved documents relevant to the user’s information need

recall: fraction of relevant documents in the collection that are retrieved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

disadvantages of Boolean retrieval

A
  • based on binary decision criteria with no notion of partial matching
  • no ranking of the documents
  • information need has to be translated into a Boolean expression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

positional indexes

A

Useful if we want to match phrases (New York University)

in the postings, store for each term the positions in which tokens of it appear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Zipf’s law

A

while only a few words are used very often, many are used rarely

if words are ranked by their frequency, the product of the rank number and frequency make up a constant: the k-th most frequent term has a frequency proportional to 1/k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Huffman coding

A
  1. Create a leaf node for each symbol and add it to the
    priority queue.
  2. While there is more than one node in queue:
    1. Remove two nodes of lowest probability from queue
    2. Create new internal node with these two nodes as children => new frequency is sum of the two nodes’ frequencies
    3. Add new node to queue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

entropy

A

minimum number of bits per symbol for lossless encoding
H(P) = - sum P(x) log P(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

why would we want index compression?

A
  • inverted lists are very large
  • compression of indexes saves disk and memory space
  • best compression techniques have good compression ratios and are easy to decompress
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

delta encoding

A

encode differences between document numbers (gaps): differences for high-frequency words are easier to compress and differences for low-frequency words are large

most gaps can be encoded with far fewer than 20 bits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

variable length encoding

A

encode every gap integer with as few bits as needed: use the first bit of a byte as continuation bit, to indicate if the value continues in the next byte or not (7 bits left for encoding)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly