C3: Boolean model and index construction Flashcards
term-document incidence matrix
rows are terms, columns are document, a 1 indicates the occurrence of that term in the document
But this explodes in size for many documents and the matrix is sparse => better to only record non-zero positions
inverted index
for each term, store a list of all documents that contain it (postings)
evaluation of a Boolean IR system
Boolean model does not perform ranking
precision: fraction of retrieved documents relevant to the user’s information need
recall: fraction of relevant documents in the collection that are retrieved
disadvantages of Boolean retrieval
- based on binary decision criteria with no notion of partial matching
- no ranking of the documents
- information need has to be translated into a Boolean expression
positional indexes
Useful if we want to match phrases (New York University)
in the postings, store for each term the positions in which tokens of it appear
Zipf’s law
while only a few words are used very often, many are used rarely
if words are ranked by their frequency, the product of the rank number and frequency make up a constant: the k-th most frequent term has a frequency proportional to 1/k
Huffman coding
- Create a leaf node for each symbol and add it to the
priority queue. - While there is more than one node in queue:
- Remove two nodes of lowest probability from queue
- Create new internal node with these two nodes as children => new frequency is sum of the two nodes’ frequencies
- Add new node to queue
entropy
minimum number of bits per symbol for lossless encoding
H(P) = - sum P(x) log P(x)
why would we want index compression?
- inverted lists are very large
- compression of indexes saves disk and memory space
- best compression techniques have good compression ratios and are easy to decompress
delta encoding
encode differences between document numbers (gaps): differences for high-frequency words are easier to compress and differences for low-frequency words are large
most gaps can be encoded with far fewer than 20 bits
variable length encoding
encode every gap integer with as few bits as needed: use the first bit of a byte as continuation bit, to indicate if the value continues in the next byte or not (7 bits left for encoding)