Week 3 Flashcards

1
Q

Two methods to optimize TR system

A

Compression

  • document indexing causes huge amount of memory
  • lossless compress index as integers to save disk space and I/O

Caching

  • Cache between FE and index on disk
  • Save I / O
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Tokenizer

A
  • represents doc in TR system
  • segments raw string to count terms
    Input - My name is Mustafa
    Output - { my: 1, name: 1, is: 1, Mustafa: 1}
  • Assign term and doc IDs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are tokenizer docs represented by

A

Vector features, process is called feature generation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Indexer

A

Only loads part of raw corpus in memory one at a time

Parsing query fast enough for a useable search engine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is inverted index

A

Quick lookup of documents given query terms - make full text search possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is TF-IDF? Reduced what? Measures what?

A

Semantic representation method to reduce side effect of bag of words (TF)

  • TF over emphasized word counts so high, frequency terms become over weighted semantic feature
  • IDF measures a term role in entire set of documents

Tf(t,d) = fd (t) / max fd(w)
idf(t,D) = ln( |D| / {d E D: t E d})
Tfidf(t,d,D) = tf(t,d) x idf(t,D)
Fd (t) = frequency of term t in document d, D = corpus of documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is compression? What does it exploit? What does it support?

A

Compresses large posting file, compressed index is both smaller and faster.
Exploits the non uniform distribution of values
Supports random access decoding
Inverted index entries sorted by DocID
- DocID = {23,25,34,35,39,43}
- Compressed {23,2,9,1,4,4}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is bitwise compression

A

Uses raw binary numbers with no fixed length or width.

Performs unary encoding ,gamma encoding or delta encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Block compression

A

Read bytes at a time rather than bits
Only one bitwise operation per byte is required
Reduce # of CPU instructions in deciding by using more memory
Decoder sum each byte and bytes are chained backwards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is caching

A
  • make frequently access objects in memory ( posting table )
  • Two goals
  • most frequent terms are fast to access
  • set a max size for cache
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Zipf’s law

A

Relatively small number of postings lists will be accessed most of the time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Least recently used cache

A

When posting list for term ID x to be retrieved

  1. Search cache for term ID
  2. If exists, return posting list for ID
  3. If not, retrieve and add to cache
  4. If cache size exceeded, remove least recently used posting list
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Double barrel LRU cache

A

Simplified approximation of LRU Cache
Use two hash tables - primary + secondary
When term ID x to be retrieved
1. Search in primary, if ID exists return
2. If not, search secondary
3.if in secondary, delete from secondary, insert to primary then return
3. If primary size exceeds, empty secondary table and swap tables
4. If not in secondary, retrieve from disc and insert into secondary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Scorer / Ranker

A

Design ranking function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Design ranking function, key challenge

A

F ( q,d )
Q - query , D - document
A good ranking function ranks relevant documents over non relevant ones

Key challenge - find a at times measure likelihood that document d is relevant to query q

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Types of retrieval models

A

Boolean Model, similarity based model, probabilistic model, probabilistic inference model, axiomatic model

17
Q

What is the Boolean model

A

Query and documents as a set of terms

Use Boolean expression to join and match terms

18
Q

What is the similarity based model

A

F(q,d) - similarity (q,d)

Vector space model

19
Q

Probabilistic models

A

Classic probabilistic
Language model
Divergence from randomness mode

20
Q

What are common ideas of TR models

A

Tokenization, Term Frequency, Document Length, Document frequency.

21
Q

Similarity based vector space model

A

Represents a doc/query by a term vector
A term denoted a word/phase/concept and defines one dimension, N terms define a N-dimensional space
Relevance of a query vector and a doc vector

22
Q

Challenges of VSM

A

Matching key words with concrete meaning dwarves more credit

Matching common/auxiliary words should be downgraded

23
Q

Solutions of VSM

A

Using term frequency TF weighting - more counts of key words have more weight

Adding inverse document frequency (IDF) to downgrade credit of common terms

24
Q

What is TF transformation BM25

A
  • Sublinear TF transformation to avoid BIG term dominance

- has upper bound to prevent the score rising to high

25
Q

TR System structure

A

Tokenizer -> Stemming -> Indexing -> Feedback -> VSM feedback -> LM feedback

26
Q

TR evaluation methods

A

Precision- among received docs how many are relevant
Recall- among truly relevant docs how many are retrieved
Best result - precision = recall = 1