Week 3 Flashcards
Information retrieval
Finding unstructured documents that satisfy an information need within large collections
Technologies for document indexing, search and retrieval
- e.g. search engines
Not only google or bing
- specialised search
Not only English
IR is not only text
- speech question answering - Hey Siri
- music retrieval - Shazam
Query
What the user conveys to the computer in an attempt to communicate the information need
Unstructured Data
In contrast to rigidly structured DBs, no obvious or easy-for-computer to deal with structure
- Could be some structure
– E.g. titles, paragraphs, HTML, XML, etc…
Information need
topic about which the user desires to address (search for) on the web
Navigational Need
Query to take the user to a page (~10% of queries)
- typically company/business/org name
- Domain suffix
- Length of query < 3
Transactional need
~10% of queries focus on doing something on the web (e.g. buying, downloading, booking, etc…)
Informational need
Collect/retrieve/obtain some information
Use of question words, many phrases
Neither navigational nor transactional
Query length > 2
When is a document relevant?
If the user perceives that it contains information of value with respect to their information need
Measuring document relevance
Does it satisfy the users need
Has the user clicked on it?
How long did they spend on it?
Have they clicked on another?
Have they reformulated the query?
Is this a new query?
Representing queries
Keywords to model information need
Representing documents
Reduce each document to a set of index terms
- ‘meaning’ of document approximated by index terms
What should be in the index?
Words? Names? Dates? Set of pre-defined tags?
Search is then focused on the index
- Map user queries to the index i.e. determine the similarity of a query to a document via index terms
Indexing process
Acquire documents through crawling, feeds, deposits, etc…
Generate (compute) an index for each document
Store the index in an easy-to-search form
Index Search (retrieval) process
- Assist the user in formulating the query
- Transform the query (“index” it)
- Map the query’s representation against the index
- Retrieve and rank the results
- Help users browse the results (and re-formulate the query)
- Log user’s actions, etc…
How to choose index terms
“(most) important words?”?
Using titles? Words in bold? Quotes?
What (if any) background knowledge did you use?
What about combining or structuring words?
Have you though about possible queries?
Have you “ranked” your words?
- How to decide on importance/relevance
- more generic vs more specific words
How about stop words? (the,and,of,it,…)
Term-document frequency matrix
rows of terms, columns of documents, each cell is the frequency of a term in the document
Sparse, massive matrices
Term-document incidence matrix
rows of terms, columns of documents, each cell is 1 if the term is in the document or 0 otherwise
Sparse, massive matrices
Inverted Index
For each term t, store all the documents that contain t, good to store frequency and position too.
Sorted by terms first, then docID
Boolean Queries
Simplest query model: find all document from the collection that fully match the query
- Binary outcome for each document yes/no
Use operators:
AND
OR
NOT
Can combine operators e.g. AND NOT
Many systems hide the operators (e.g. space is used as AND)
Pros:
- Simple model: everything is considered as a set of terms
- Easy/efficient to implement
- Precise (document either matches or doesn’t match)
- Widely used for commercial, legal retrieval and for specialist searches (long, precise queries)
- Works well when we know what we want: user feels in control
Cons:
- Users are not good in formulating queries: possible confusion with natural language
- Feast or famine: AND gives too few results, OR gives too many
- No relevance ordering / ranking of results
- Ignores word order, word frequency, etc…
- Documents that are ‘close’ are not retrieved
Boolean Queries AND Merge
Term1 AND Term2
Find Term1 and Term2 in the dictionary
Merge the results so that you are left with the documents that occur in both
Crucial postings are sorted by docID to do this merge efficiently
Extended boolean model
Proximity operators
Embed term positions in the inverted index (proximity index)
Queries can then refer to:
/n - (e.g. /3 within 3 words)
/s - in the same sentence, /p in the same paragraph
+s = term1 must proceed term2 in same sentence
Ranked retrieval
Introduce similarity and ranking of matched documents
- Score each document to say how well it matches a query
- Typically aim to get top 10 ones correct (user typically will not look further)
Idea: Use vector representation for both documents and queries, and calculate their similarity
Vector representations
Represent both documents and queries as vectors in the same space
Rank (all) documents according to their proximity in that given space
Do this to get away from the either-you’re-in-or-out model
Term frequency
tf or term t and document d is the number of times t appears in document d
Relevance does not increase proportionally with term frequency
Document frequency
In how many documents a term appears
Rare terms are more informative and discriminative than frequent terms for information retrieval
Collection frequency
The number of occurrences of t in the collection (counting multiple occurrences in the same document)
With equal collection frequency, the term with lower document frequency is the more informative “better” term
Inverse document Frequency
idf = log(N/df) where N is the number of documents
Based of the log is immaterial, log is used to dampen the effect
Tf-Idf
Term frequency vs Inverse Document Frequency
tf - the number of times a term appears in a document
idf - Log(N/df) where N is the number of documents
df - The number of documents a term appears in
Increases
- With the number of occurrences within a document
- With the rarity of the term in the collection
tf-idf document ranking for a query
Score of a document d is sum over all tf-idf weights of each query term t found in d
Indexing: For each term and document calculate tf*idf
Typical output: A list of documents ranked according to score (q,d)
What’s wrong with VSMs for information retrieval
Vector space models work reasonably well, but:
- Based on bag of words, so ignore context
- Heuristic in nature
– Not notion of relevance, similarity only
– Weights seem to work - mostly engineering (rather than theory) - Don’t adapt to the user or collection
– Term weights should be user and domain specific
– How to “train” on a particular collection
Does not model uncertainty
PageRank
See Page Rank slides
Adapt ranking to user clicks?
Higher positions receive more user attention (eye fixation) and clicks than lower positions
This is true even in the setting that the order of positions is reversed
They are “Informative but biased”
Document relevant feedback
User feeds back on relevance of docs in an initial set of results
- User issues a (short, simple) query
- The user marks some results as relevant or non-relevant
- The system computes a better representation of information need based on feedback
- Relevant feedback can go through one or more iterations
Idea: It may be difficult to formulate a good query when you don’t the collection well, so iterate
We can modify the query based on relevance feedback and apply a standard vector space model
Relevance feedback can import precision and recall.
It is most useful for improving recall in situations where recall is important (users can be expected to review results and to take time to iterate)
Problems:
- Users are often reluctant to give explicit feedback
- Often hard to understand why a particular document was retrieved after applying relevance feedback
Long queries are inefficient for typical IR engine
- Long response times for user
- High cost for retrieval system
- Partial solution
– Only rewrite certain prominent terms
— Perhaps top 20 by term frequency
Query Expansion
In relevance feedback, users give additional input (relevant / non-relevant) on documents which is used to rewrite terms in the documents
In query expansion, users give additional input (good/bad search terms) on words or phrases
Add more words of the same (or more specific) meaning to your query for better retrieval
Could use a thesaurus, automatically derived thesaurus. Refinements based on query log mining (common on the web)
Thesaurus based query expansion
For each term t in a query, expand the query with synonyms and related words of t from a thesaurus
feline -> feline cat
May weight added terms less than original query terms
Generally increases recall but may significantly decrease precision, particularly with ambiguous terms.
“interest rate” -> “interest rate fascinate evaluate”