Lecture 7 - Information Retrieval Flashcards
An IR system vs Database can also be phrased as
Unstructured vs Structured data
Structured data tends to refer to information in…
Tables
Describe the Boolean Retrieval Model
Boolean Retrieval Models refer to whether a query is present in the database/document.
It is absolute (Document matches condition or not)
What is the issue with Boolean Retrieval Model?
With bigger collections, this becomes computationally heavy
Example:
- N = 1M documents, each with about 1000 words → Around 6gb of data.
- If there are 500K distinct terms in these documents, then a matrix of size 500k * 1M will have half-a-trillion 0’s and 1’s
- But no more than one billon 1’s
- Matrix is extremely sparse
- What’s a better representation?
- We only store the ‘1’ positions → Inverted Index
Also:
Boolean queries often result in either too few (=0) or too many (1000s) results
It takes a lot of skill to come up with a query that produces a manageable number of hits
- AND gives too few; OR gives too many
What is an Inverted Index?
An inverted index keeps a dictionary of terms and for each term t, store a list of all documents that contain t
I.e: You can say that Boolean Retrieval Model is from the perspective of the document, whereas Inverted Index is from the perspective of the term
What are the steps of constructing an inverted index?
Documents to be indexed -> Tokenizer -> Linguistic Modules -> Indexer = Inverted Index
What is meant by phrase queries?
We want to be able to answer queries such as “Stanford university” as a phrase.
For phrase queries, is it sufficient to store entries
No.
What is meant by biword indexes
Instead of storing just single words, we now store biwords
Example:
“Friends, Romans, Countrymen” would now store:
- friends romans
- romans countrymen
This allows us to do two-word phrase query-processing
How can we do longer phrase queries using biwords?
Longer phrases can be processed by breaking them down
Example:
“Stanford university palo alto” can be broken into the boolean query on biwords:
- stanford university AND university palo AND palo alto
We cannot verify that the docs matching the above Boolean query do contain the phrase
What are the issues for biwords indexes?
False Positives
Index blowup due to bigger dictionary
- Infeasible for more than biwords, big even for them
Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy
Name an alternative to biword indexes
Positional Indexes
Is a positional index larger or smaller than a non-positional index
Substantially larger
Can you combine Biword Indexes and Positional Indexes?
Yes
What is the difference between Boolean Retrieval Models and Ranked Retrieval Models?
Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query