Information Retrieval Flashcards
Information Retrieval Systems:
Overview
- Simpler model than RDBMS
- Info organized as a collection of documents
- Documents are unstructured, no schema(generally)
- Locate relevant documents based on user input such as keywords
- Can be used on textual descriptions that are provided with non-textual data
- Example: Images with descriptions
- Web Search engines are the most popular form of IR systems
Information Retrieval Systems:
Differences with RDBMS
- IR Systems organize information into unstructured documents
- Where database systems deal with structured data
- IR Systems don’t deal with transactional updates
- IR Systems deal with some querying issues not generally addressed by database systems
- Approximate searching by keywords
- Ranking retrieved answers
- by estimating degree of relevance
Information Retrieval Systems:
Querying issues that
database systems do not deal with
- Approximate Searching by keywords
- Ranking of retrieved answers by estimating degree of relevance
IR Systems:
Terminology
- Full Text Retrieval
- Keywords
- Keyword Queries
- Term
- Intent
- Query
- Result
Keyword Search
- In Full Text Retrieval, all the words in each document are considered to be keywords
- The word “term” refers to the words in a document
- IR Systems typically allow query expressions
- __Formed using keywords and logical connectives
- and, or, not
- __Formed using keywords and logical connectives
- Ranking of documents on the basis of estimated relevance to a query is critical
IR Systems:
Returned Results
- In relational databases, there is generally no ranking on returned results
- In contrast, when dealing with Keyword Queries, many documents may match the query
- A google search can return millions of results
- In contrast, when dealing with Keyword Queries, many documents may match the query
- Results must be ranked so that the most relevant tuples are the first thing a user sees
- Need to determine what the user is looking for
Keyword Queries:
Analysis Components
- Intent
- What the user is actually looking for
- Also referred to as information need
- Query
- What the user actually submits to express their intent
- Result
- The documents that match the query and are ranked to match the intent as much as possible
Static Ranking
- Methods of ranking where the ranking does not change over time, unless the underlying data changes
- Good start to ranking based on entered keywords
- Popular Methods:
- TF-IDF
- BM25
- Page Rank
- HITS
Static Ranking Methods:
TF-IDF
Basic Idea
Term Frequency - Inverted Document Frequency
- Used to determine how relevant a document is given keywords
- The more frequent a given keyword is in a document, the higher the ranking
- Simple ratio of how many times the keyword appears over how many total terms in document
- Uses a logarithmic scale so that very frequent documents do become too “important”
- Was used for a while in early web searching
- Yahoo, MSN, AOL, lots of search engines
Static Ranking Methods:
TF-IDF
Basic Formula
Term Frequency - Inverted Document Frequency
TF(d,t ) = log( 1 + n(d,t ) / n(d ) )
- n(d ) is the number of terms in a document
- n(d, t) is the number of times term t occurs in document d
Static Ranking Methods:
TF-IDF:
Calculating Actual Relevance
relevance(d, Q) = SUMt in Q [TF(d,t) / n(t)]
n(t) is the number of documents containing the term t
This lowers the score of super frequent terms,
making more unique terms in the query more important
Ranking:
Why is it important to rank returned results?
- Users do not want to sort through potentially millions of answers
- Most google users never go to the second page
- Percentage of Users visiting beyond the second page,3rd, 4th, etc, decreases exponentially with the number of pages
- More likely to rewrite queries instead of searching through second page and beyond
- Users are short on time, have little domain knowledge and don’t want to scroll forever
- Ranking is a method to decode the user’s intent from their queries
Static Ranking Methods:
TF-IDF:
Other methods to expand TF-IDF
- Add more weight to important areas of document
- Beginning, title, sections
- Reduce weight of documents
- Words occurring late in document
- What else did we already cover that should be reduced?
- Proximity
- Terms close together are weighted higher
- Use N-grams
- Terms close together are weighted higher
Static Ranking Methods:
BM25
Basics
- More effective compared to TF-IDF
- Uses constants to scale certain values
- Considers the frequency of a term in the query
- Used for massive text documents
Ranking Web Pages
- Web pages are really just a collection of documents online
- Can search them just like local collections of documents
- Extra things to consider:
- When ranking, what web page should be first
- Many search engines use to use something like TF-IDF
- TF-IDF can fail for many reasons
- Somewhat easy to manipulate
- Google revolutionized web page ranking with an algorithm called PageRank
Page Rank:
Overview
- Ranking algorithm developed by google to solve the problems plaguing web search
- Simple Idea:
- Uses the links between pages to “vote” on the importance of a page
- If a page is popular, its “vote” counts for more
- It is recursive - earlier rankings affect future rankings
Page Rank:
Formula
R(u) = c*SUM<em>v in B<strong>u</strong></em> [R(v) / Nv]
- R(u) = Rank of web page u
- Bu = set of pages pointing to the page u
- Fu = set of pages u points to
- Nu = | Fu| = number of links from u
Page Rank:
Recursion
- Since a page’s ranking depends on the ranking of pages linked to it, there is some recursion: a page’s earlier ranking affects it’s ranking
- The formula is iterated until the change in rank from one iteration to the next is under some predefined threshold
- Epsilon is usually used as the variable
- Dangling links are ignored until PageRank has been calculated once
- The original paper explicitly states “…computation of PageRank is fairly straightforward if we ignore issues of scale”
Page Rank:
Dangling Links
Pages that “go nowhere”.
They do not have links to other pages.
Ignored until PageRank has been calculated once, since other pages need to be recursively iterated.