Lecture 7 - Information Retrieval Flashcards
An IR system vs Database can also be phrased as
Unstructured vs Structured data
Structured data tends to refer to information in…
Tables
Describe the Boolean Retrieval Model
Boolean Retrieval Models refer to whether a query is present in the database/document.
It is absolute (Document matches condition or not)
What is the issue with Boolean Retrieval Model?
With bigger collections, this becomes computationally heavy
Example:
- N = 1M documents, each with about 1000 words → Around 6gb of data.
- If there are 500K distinct terms in these documents, then a matrix of size 500k * 1M will have half-a-trillion 0’s and 1’s
- But no more than one billon 1’s
- Matrix is extremely sparse
- What’s a better representation?
- We only store the ‘1’ positions → Inverted Index
Also:
Boolean queries often result in either too few (=0) or too many (1000s) results
It takes a lot of skill to come up with a query that produces a manageable number of hits
- AND gives too few; OR gives too many
What is an Inverted Index?
An inverted index keeps a dictionary of terms and for each term t, store a list of all documents that contain t
I.e: You can say that Boolean Retrieval Model is from the perspective of the document, whereas Inverted Index is from the perspective of the term
What are the steps of constructing an inverted index?
Documents to be indexed -> Tokenizer -> Linguistic Modules -> Indexer = Inverted Index
What is meant by phrase queries?
We want to be able to answer queries such as “Stanford university” as a phrase.
For phrase queries, is it sufficient to store entries
No.
What is meant by biword indexes
Instead of storing just single words, we now store biwords
Example:
“Friends, Romans, Countrymen” would now store:
- friends romans
- romans countrymen
This allows us to do two-word phrase query-processing
How can we do longer phrase queries using biwords?
Longer phrases can be processed by breaking them down
Example:
“Stanford university palo alto” can be broken into the boolean query on biwords:
- stanford university AND university palo AND palo alto
We cannot verify that the docs matching the above Boolean query do contain the phrase
What are the issues for biwords indexes?
False Positives
Index blowup due to bigger dictionary
- Infeasible for more than biwords, big even for them
Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy
Name an alternative to biword indexes
Positional Indexes
Is a positional index larger or smaller than a non-positional index
Substantially larger
Can you combine Biword Indexes and Positional Indexes?
Yes
What is the difference between Boolean Retrieval Models and Ranked Retrieval Models?
Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query
What is meant by “Free text queries”?
Rather than a query language of operators and expression (e.g. SQL), the user’s query is just one or more words in a human language
What are some ways of scoring as the basis of ranked retrieval?
Term Frequency
-> Log-frequency weighting
IDF weight
TF-IDF weighting
Explain Term Frequency
Term frequency is the number of occurrences of a term within a document.
The term frequency tf_t,d of term t in document d is defined as the number of times that t occurs in d
Is raw term frequency sufficient?
No: A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. But not 10 times more relevant -> Relevance does not increase proportionally with term frequency
Therefore, we use log-frequency weighting
Explain Document Frequency
Document frequency is the number of documents containing a particular term
What is idf an acronym for?
Inverse document frequency
Explain idf
The inverse document frequency (IDF) is a statistical weight used for measuring the importance of a term in a text document collection. The document frequency DF of a term is defined by the number of documents in which a term appears.
Why is document frequency insufficient?
We want high weight for rare terms
Rare terms are more informative than frequent terms
- Recall stop words (the, are, of)
- Also terms like high, car increase, line not so relevant
Consider a term in the query that is rare in the collection (e.g., arachnocentric)
- A document containing this term is very likely to be relevant to the query arachnocentric
What is the effect of idf on ranking for one-term queries, like “iphone”?
idf has no effect on ranking one-term queries
- idf affects the ranking of documents for queries with at least two terms
- For the query “capricious person”, idf weighting makes occurrences of “capricious” count for much more in the final document ranking than occurrences of “person”
Explain tf-idf weighting
The tf-idf weight of a term is the product of its tf weight and its idf weight
-> This is the best known weighting scheme in information retrieval
(the hyphen in tf-idf is a hyphen, NOT a minus sign)
- Increases with number of occurrences within a document
- Increases with the rarity of the term in the collection
When processing documents, we typically store these in…
Vectors
- |V| dimensional vector space
- Terms are axes of the space
- Documents are points or vectors in this space
-> Very high dimensional (Tens of millions of dimensions when you apply this to a web search engine)
Could we store queries as vectors as well?
Yes!
There are two key ideas:
1. Do the same for queries: represent them as vectors in the space
2. Rank documents according to their proximity to the query in this space
Why would we like to store queries as vectors?
We do this because we want to get away from the “You’re either in or out” boolean model
Instead: Rank more relevant documents higher than less relevant documents
Why is distance a bad idea to estimate proximity between two vectors?
See notion for a graph
What should we use instead of euclidean distance to measure proximity?
We use the angle instead
We should rank documents according to angle with query instead -> The angle between the two documents is 0, corresponding to maximal similarity
How are cosines related to text similarity?
You can use cosines to compute similarity between two words or documents or documents and queries etc
How does the cosine function for a query and document work?
See notion for formula
How can you evaluate information retrieval?
Precision: Fraction of retrieved docs that are relevant = P(relevant|retrieved)
Recall: Fraction of relevant docs that are retrieved P(retrieved | relevant)
Why shouldn’t you rely on accuracy alone when evaluating information retrieval?
In almost all circumstances, the data is extremely skewed: normally over 99.9% of the documents are in the nonrelevant category
What is a combined measure that you can use to assess the precision/recall
F score (weighted harmonic mean)
People usually use balanced F1 measure ( with B=1 or a = 1/2)
B < 1 emphasize precision; B>1 emphasize recall
What can you use a precision-recall curve for?
Seeing how to balance precision vs recall
-> However, you need see the precision-recall curve for a whole bunch of queries