Week 3 Flashcards
Two methods to optimize TR system
Compression
- document indexing causes huge amount of memory
- lossless compress index as integers to save disk space and I/O
Caching
- Cache between FE and index on disk
- Save I / O
What is a Tokenizer
- represents doc in TR system
- segments raw string to count terms
Input - My name is Mustafa
Output - { my: 1, name: 1, is: 1, Mustafa: 1} - Assign term and doc IDs
What are tokenizer docs represented by
Vector features, process is called feature generation
What is Indexer
Only loads part of raw corpus in memory one at a time
Parsing query fast enough for a useable search engine
What is inverted index
Quick lookup of documents given query terms - make full text search possible
What is TF-IDF? Reduced what? Measures what?
Semantic representation method to reduce side effect of bag of words (TF)
- TF over emphasized word counts so high, frequency terms become over weighted semantic feature
- IDF measures a term role in entire set of documents
Tf(t,d) = fd (t) / max fd(w)
idf(t,D) = ln( |D| / {d E D: t E d})
Tfidf(t,d,D) = tf(t,d) x idf(t,D)
Fd (t) = frequency of term t in document d, D = corpus of documents
What is compression? What does it exploit? What does it support?
Compresses large posting file, compressed index is both smaller and faster.
Exploits the non uniform distribution of values
Supports random access decoding
Inverted index entries sorted by DocID
- DocID = {23,25,34,35,39,43}
- Compressed {23,2,9,1,4,4}
What is bitwise compression
Uses raw binary numbers with no fixed length or width.
Performs unary encoding ,gamma encoding or delta encoding
What is Block compression
Read bytes at a time rather than bits
Only one bitwise operation per byte is required
Reduce # of CPU instructions in deciding by using more memory
Decoder sum each byte and bytes are chained backwards.
What is caching
- make frequently access objects in memory ( posting table )
- Two goals
- most frequent terms are fast to access
- set a max size for cache
What is Zipf’s law
Relatively small number of postings lists will be accessed most of the time
Least recently used cache
When posting list for term ID x to be retrieved
- Search cache for term ID
- If exists, return posting list for ID
- If not, retrieve and add to cache
- If cache size exceeded, remove least recently used posting list
Double barrel LRU cache
Simplified approximation of LRU Cache
Use two hash tables - primary + secondary
When term ID x to be retrieved
1. Search in primary, if ID exists return
2. If not, search secondary
3.if in secondary, delete from secondary, insert to primary then return
3. If primary size exceeds, empty secondary table and swap tables
4. If not in secondary, retrieve from disc and insert into secondary
Scorer / Ranker
Design ranking function
Design ranking function, key challenge
F ( q,d )
Q - query , D - document
A good ranking function ranks relevant documents over non relevant ones
Key challenge - find a at times measure likelihood that document d is relevant to query q