01 Boolean Retreival Flashcards

1
Q

Why is grep not the solution?

A
  • Slow (for large collections)
  • grep is line-oriented, IR is document-oriented
  • Other operations (e.g., find the word Romans near
    countryman) not feasible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why incidence matrix is not good?

A
  • Matrix is extremely sparse.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a better representation than incidence matrix?

A

Inverted index: only record the 1s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is inverted index?

A
  • For each term t, we store a list of all documents that contain t.
  • Consists of dictionary and postings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is inverted index constructed?

A

1.Collect the documents
2. Tokenize
3. Linguistic preprocessing
4. Index the documents > sort > create posting list and count document frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the runtime for conjuctive query (AND) ?

A

linear in the sum length of the postings lists

only works if postings lists are sorted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are boolean queries?

A
  • queries that use and, or and not to join
    query terms
  • Views each document as a set of terms
  • Is precise: Document matches condition or not
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name commercially successful Boolean retrieval systems

A
  1. Westlaw
  2. PubMed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the best order for processing a query
that is an and of n terms, n > 2?

Example query: Brutus AND Calpurnia AND Caesar

A

Process in order of increasing frequency

Start with the shortest postings list, then keep cutting further

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to process :
(madding or crowd) and (ignoble or strife) and. . .

A
  • Get frequencies for all terms
  • Estimate the size of each or by the sum of its frequencies (conservative)
  • Process in increasing order of or sizes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly