Week 3 Flashcards
Information retrieval
Finding unstructured documents that satisfy an information need within large collections
Technologies for document indexing, search and retrieval
- e.g. search engines
Not only google or bing
- specialised search
Not only English
IR is not only text
- speech question answering - Hey Siri
- music retrieval - Shazam
Query
What the user conveys to the computer in an attempt to communicate the information need
Unstructured Data
In contrast to rigidly structured DBs, no obvious or easy-for-computer to deal with structure
- Could be some structure
– E.g. titles, paragraphs, HTML, XML, etc…
Information need
topic about which the user desires to address (search for) on the web
Navigational Need
Query to take the user to a page (~10% of queries)
- typically company/business/org name
- Domain suffix
- Length of query < 3
Transactional need
~10% of queries focus on doing something on the web (e.g. buying, downloading, booking, etc…)
Informational need
Collect/retrieve/obtain some information
Use of question words, many phrases
Neither navigational nor transactional
Query length > 2
When is a document relevant?
If the user perceives that it contains information of value with respect to their information need
Measuring document relevance
Does it satisfy the users need
Has the user clicked on it?
How long did they spend on it?
Have they clicked on another?
Have they reformulated the query?
Is this a new query?
Representing queries
Keywords to model information need
Representing documents
Reduce each document to a set of index terms
- ‘meaning’ of document approximated by index terms
What should be in the index?
Words? Names? Dates? Set of pre-defined tags?
Search is then focused on the index
- Map user queries to the index i.e. determine the similarity of a query to a document via index terms
Indexing process
Acquire documents through crawling, feeds, deposits, etc…
Generate (compute) an index for each document
Store the index in an easy-to-search form
Index Search (retrieval) process
- Assist the user in formulating the query
- Transform the query (“index” it)
- Map the query’s representation against the index
- Retrieve and rank the results
- Help users browse the results (and re-formulate the query)
- Log user’s actions, etc…
How to choose index terms
“(most) important words?”?
Using titles? Words in bold? Quotes?
What (if any) background knowledge did you use?
What about combining or structuring words?
Have you though about possible queries?
Have you “ranked” your words?
- How to decide on importance/relevance
- more generic vs more specific words
How about stop words? (the,and,of,it,…)
Term-document frequency matrix
rows of terms, columns of documents, each cell is the frequency of a term in the document
Sparse, massive matrices