Boolean Retrieval Model Flashcards
What is the boolean retrieval model?
A method of information retrieval that can answer any query that is a boolean expression
What is a boolean query?
Queries using AND, OR, and NOT to join query terms
What is a limitation of the boolean retrieval model?
Only records if a document matches a condition or not, no additional data such as frequency or proximity.
What is a phrase query?
Strings of multiple tokens that are meant to be used together.
Ex: “University of Toronto”
Why can’t the classic boolean model perform phrase queries?
It doesn’t record information about proximity of terms to one another
What is a biword index?
Indexing every consecutive pair of terms
What are the benefits of biword indexes?
- Two word phrase query processing is immediate
- Can search for longer phrases with some proximity info
How can longer phrase queries be processed with a biword index?
Breaking down the query into a list of biwords. Perform separate query on each biword and conjunct results
What are the downsides of a biword index?
- False positives as we cannot verify the whole contiguous string appears
- Index takes more storage due to the bigger dictionary
What is a positional index?
In the postings table, store document frequency as well as positions in which the token appears
How do you process a phrase query using a positional index?
- Extract index entries for each term
- Merge their doc:position lists to enumerate all positions between words
Why can’t biword indexes be used for longer phrase queries?
The issue of false positives. It only checks if pairs of words are present beside each other not the contiguous series
What is the drawback of a positional index and why is it used regardless?
It expands postings storage substantially as we store every occurrence of each term. We use it anyway because it brings value through phrase and proximity queries
What are the rules of thumb about positional index size?
- A positional index is 2-4 times as large as a non-positional index
- A positional index is 35-50% of the volume of the original text
How can we combine the biword and positional index?
Store the positions of each biword combination
What is the size of a biword positional index?
About 26% bigger than a positional index alone