Information Retrieval Flashcards
Information Retrieval Systems:
Overview
- Simpler model than RDBMS
- Info organized as a collection of documents
- Documents are unstructured, no schema(generally)
- Locate relevant documents based on user input such as keywords
- Can be used on textual descriptions that are provided with non-textual data
- Example: Images with descriptions
- Web Search engines are the most popular form of IR systems
Information Retrieval Systems:
Differences with RDBMS
- IR Systems organize information into unstructured documents
- Where database systems deal with structured data
- IR Systems don’t deal with transactional updates
- IR Systems deal with some querying issues not generally addressed by database systems
- Approximate searching by keywords
- Ranking retrieved answers
- by estimating degree of relevance
Information Retrieval Systems:
Querying issues that
database systems do not deal with
- Approximate Searching by keywords
- Ranking of retrieved answers by estimating degree of relevance
IR Systems:
Terminology
- Full Text Retrieval
- Keywords
- Keyword Queries
- Term
- Intent
- Query
- Result
Keyword Search
- In Full Text Retrieval, all the words in each document are considered to be keywords
- The word “term” refers to the words in a document
- IR Systems typically allow query expressions
- __Formed using keywords and logical connectives
- and, or, not
- __Formed using keywords and logical connectives
- Ranking of documents on the basis of estimated relevance to a query is critical
IR Systems:
Returned Results
- In relational databases, there is generally no ranking on returned results
- In contrast, when dealing with Keyword Queries, many documents may match the query
- A google search can return millions of results
- In contrast, when dealing with Keyword Queries, many documents may match the query
- Results must be ranked so that the most relevant tuples are the first thing a user sees
- Need to determine what the user is looking for
Keyword Queries:
Analysis Components
- Intent
- What the user is actually looking for
- Also referred to as information need
- Query
- What the user actually submits to express their intent
- Result
- The documents that match the query and are ranked to match the intent as much as possible
Static Ranking
- Methods of ranking where the ranking does not change over time, unless the underlying data changes
- Good start to ranking based on entered keywords
- Popular Methods:
- TF-IDF
- BM25
- Page Rank
- HITS
Static Ranking Methods:
TF-IDF
Basic Idea
Term Frequency - Inverted Document Frequency
- Used to determine how relevant a document is given keywords
- The more frequent a given keyword is in a document, the higher the ranking
- Simple ratio of how many times the keyword appears over how many total terms in document
- Uses a logarithmic scale so that very frequent documents do become too “important”
- Was used for a while in early web searching
- Yahoo, MSN, AOL, lots of search engines
Static Ranking Methods:
TF-IDF
Basic Formula
Term Frequency - Inverted Document Frequency
TF(d,t ) = log( 1 + n(d,t ) / n(d ) )
- n(d ) is the number of terms in a document
- n(d, t) is the number of times term t occurs in document d
Static Ranking Methods:
TF-IDF:
Calculating Actual Relevance
relevance(d, Q) = SUMt in Q [TF(d,t) / n(t)]
n(t) is the number of documents containing the term t
This lowers the score of super frequent terms,
making more unique terms in the query more important
Ranking:
Why is it important to rank returned results?
- Users do not want to sort through potentially millions of answers
- Most google users never go to the second page
- Percentage of Users visiting beyond the second page,3rd, 4th, etc, decreases exponentially with the number of pages
- More likely to rewrite queries instead of searching through second page and beyond
- Users are short on time, have little domain knowledge and don’t want to scroll forever
- Ranking is a method to decode the user’s intent from their queries
Static Ranking Methods:
TF-IDF:
Other methods to expand TF-IDF
- Add more weight to important areas of document
- Beginning, title, sections
- Reduce weight of documents
- Words occurring late in document
- What else did we already cover that should be reduced?
- Proximity
- Terms close together are weighted higher
- Use N-grams
- Terms close together are weighted higher
Static Ranking Methods:
BM25
Basics
- More effective compared to TF-IDF
- Uses constants to scale certain values
- Considers the frequency of a term in the query
- Used for massive text documents
Ranking Web Pages
- Web pages are really just a collection of documents online
- Can search them just like local collections of documents
- Extra things to consider:
- When ranking, what web page should be first
- Many search engines use to use something like TF-IDF
- TF-IDF can fail for many reasons
- Somewhat easy to manipulate
- Google revolutionized web page ranking with an algorithm called PageRank
Page Rank:
Overview
- Ranking algorithm developed by google to solve the problems plaguing web search
- Simple Idea:
- Uses the links between pages to “vote” on the importance of a page
- If a page is popular, its “vote” counts for more
- It is recursive - earlier rankings affect future rankings
Page Rank:
Formula
R(u) = c*SUM<em>v in B<strong>u</strong></em> [R(v) / Nv]
- R(u) = Rank of web page u
- Bu = set of pages pointing to the page u
- Fu = set of pages u points to
- Nu = | Fu| = number of links from u
Page Rank:
Recursion
- Since a page’s ranking depends on the ranking of pages linked to it, there is some recursion: a page’s earlier ranking affects it’s ranking
- The formula is iterated until the change in rank from one iteration to the next is under some predefined threshold
- Epsilon is usually used as the variable
- Dangling links are ignored until PageRank has been calculated once
- The original paper explicitly states “…computation of PageRank is fairly straightforward if we ignore issues of scale”
Page Rank:
Dangling Links
Pages that “go nowhere”.
They do not have links to other pages.
Ignored until PageRank has been calculated once, since other pages need to be recursively iterated.
HITS:
Overview
Hypertext Induced Topics Search (HITS)
- Developed by Jon Kleinberg in 1998
- Distinguishes types of queries:
- Specific Query - one that has the Scarcity Problem
- Broad Query - one that has the Abundance Problem
- Some pages have special roles:
- Authorities
- Hubs
HITS:
Authorities
Pages that have many incoming links from Hubs
- Recognized as providing significant, trustworthy, and useful information on some topic
- As some “degree” of pages that point to it
- Comparing the degree allows us to distinguish between authorities
HITS:
Hubs
Pages that point to many other relevant pages
- Can act as an “index”
- Generally point to lots of authorities
- For example, the course page is a Hub
HITS:
How Authorities and Hubs work together
Hubs point to many authorities
Authorities are referenced by many Hubs
Together, they create a bipartite graph,
essentially delineating different topics
Information Retrieval:
Determining Success
- When results are returned to the user, how do we determine whether they got the correct results or not?
- Multiple methods can be used to analyze this, depending on the data returned and the queries that were issued
- Also used to determine how well a database retrieval system is performing
- Specifically effectiveness, not efficiency
IR Analysis:
Efficiency vs Effectiveness
- Efficiency is primarly concerned with time
- Query Latency vs Query Throughput
- Indexing, query optimization, etc
- Effectiveness is concerned with accuracy
- Usually need some kind of feedback from the user
Effectiveness:
Determining Feedback
- Generally get feedback from the user
- Implicit
- Explicit
- Pseudo Explicit
-
Explicit Feedback is provided by the user
- Clicks, view time, etc
-
Implicit Feedback is drawn from the data
- Add to search query a term that is frequent in the results
Effectiveness:
Common Metrics
There are some well defined metrics:
- Precision
- Recall
- Relevance
- F-Measure (or F-Score)
- Reciprocal Rank
- Mean Reciprocal Rank(MRR)
- Normalized Discounted Cumulative Gain (NDCG)
Effectiveness:
Usefulness of Metrics
- Can be used to learn:
- Maximize given metrics
- Reinforce strategies
- Can also use to determine effectiveness
Effectiveness:
Definitions:
False Negative
False Positive
False Negative
Some relevant documents may not be retrieved
False Positive
Some irrelevant documents may be retrieved
Effectiveness:
Precision
The ratio of desired tuples
to the tuples returned to the user
- P@K is used for K returned results, or per page
- Example:
- If IR retrieved the top 10 results and 5 of them are relevant to the user
- Then precision is 5/10 , or 1/2
- Max precision may not always be 1
Effectiveness:
Recall
Recall
is the ratio of relevant tuples returned to the user
vs the number of relevant tuples in the database
- Example:
- If IR returned the top 10 results
- 5 of them are relevant
- but there are 20 relevant tuples in the database
- Recall is 5/20, or 1/4
- Can be trivially increased by returning more documents,
- But this can decrease precision
- May not always be possible to get recall of 1
Effectiveness:
Precision vs Recall
- Recall can be trivially increased
- Simply return more documents
- Returning more documents can cause precision to decrease, so there is a tradeoff
- The converse is also true:
- Returning a single relevant tuple gives 100% precision, but this gives terrible recall
- The F-Measure method provides a way to combine both Precision and Recall
Effectiveness:
F-Measure
F-Measure
- Also called F-Score
- Formula for combining the Precision and Recall metrics
- Balances the trade-off between the two metrics
Effectiveness:
F-Measure:
Formula
Applies a constant, alpha
as a weight to get a metric that is balanced between Precision and Recall
- F1 Measure:
- alpha = 0.5 -> beta =1
- Perfectly balanced between Precision and Recall

Effectiveness:
Precision and Recall
Over Time
- Both metrics can be used to evaluate the effectiveness of retrieval systems over time
- The system is usually queried for with more than a single intent
- Take the total precision/recall and divide by the number of interactions
- Gives an average
- Good for tracking if the system is improving over time or if some change in the system worked
Effectiveness:
Order
Precision, Recall and F-Measure don’t take into account order.
However, order of the results matters.
- Users usually only look at the first few results
- Need a way to measure satisfaction that also takes order into account
- Metrics:
- Reciprocal Rank(RR)
- Mean Reciprocal Rank(MRR)
Effectiveness:
Reciprocal Rank (RR)
Reciprocal Rank (RR)
The inverse position for the first relevant document
RR = 1 / posr
Used when we care about the position of the document in the returned results
Effectiveness:
Mean Reciprocal Rank (MRR)
Mean Reciprocal Rank (MRR)
The average of the inverse of the position for the first relevant document
Useful for multiple interactions, especially when there is only a single relevant result.

Effectiveness:
Relevance of Document
- How well a document matches what a user is actually looking for
- Some documents may be related, but not actually what they are looking for
- Basic: Relevance Judgement Scores
- Files that are created/learned by the system over time
- Rank documents from 0-4 on a given intent
- Expensive and biased
- Best to use a simple binary model, yes or no(click or no click)
- More Advanced: NDCG
Effectiveness:
Considerations/Factors that determine overall effectiveness
- Precision
- Single Query
- Over time
- Recall
- Single Query
- Over time
- Order
- Relevance of document
Effectiveness:
Normalized Discounted Cumulative Gain
(NDCG)
Normalized Discounted Cumulative Gain (NDCG)
A measure of the relevance of a document
- Range between 0-1
- Normalized from DCG
- DCG gives a larger score for documents with high relevance
- But, DCG is unbounded
- Normalized by dividing DCG by IDCG
- NDCG = DCG / IDCG
Effectiveness:
Discounted Cumulative Gain (DCG)
and IDCG
Discounted Cumulative Gain (DCG)
Gives larger score for documents with high relevance.
Ideal DCG(IDCG) is calcualted with the tuples ordered in optimal order based on relevance. Used for normalizing DCG.

Ranking Optimization:
Keyword Analysis
Methods/Considerations
- Stopping
- Stemming
- Proximity/Clustering
- N-Grams
- Synonyms
- Homonyms
Ranking Optimization:
Stopping
Removes extremely common terms
from the retrieval method or index
- Some terms are so common that they are not useful:
- “the” “and” “a” , etc
- The most frequently occurring words make up the majority of the size of the documents
- Uses Zipf’s Law to determine what words count as “stop words”
- Results in less terms to search over
- More accurate keyword searches
Ranking Optimization:
Stemming
Reduces some words to their “stem” prior to searching
- User’s keyword queries may not be an exact match, but rather some variant of the word
- Plurals, past tense, suffixes, prefixes, etc
- Stemming removes prefixes and postfixes
- Can reduce the size of the inverted index
- May use Table Lookup to find minimum form of different terms
- NLTK library does this
- Not always effective
Ranking Optimization:
Proximity
- Idea:
- Keywords that are in the query might appear close to each other in the document
- The closer they are to each other, the more important the document is
- Vice Versa
- Can also consider the order of the keywords
- One way to take advantage of this is the use of N-Grams
Ranking Optimization:
N-Grams
- Basic Idea: Clustering characters together
- Not too useful for this context
- Better: Clustering words together
- Given: “Hello World. Hello Ben”
- 2-Grams:
- “Hello World”, “World. Hello”, “Hello Ben”
- 3-Grams:
- “Hellow World. Hello” , “World. Hello Ben”
Ranking Optimization:
Synonyms and Homonyms
Synonyms
Consider synonyms for come terms the user includes in the query:
- Example: Document: “My instructor” , Query: “teacher”
- Expand query to include both terms “teacher or instructor”
- Can be complicated to expand too much, many terms have synonyms that are not appropriate to the user’s intent
Homonyms
Query may include a homonym of the term the user actually intended.
Can also be tricky.