Week 3 Flashcards

Question 1

Q

Information retrieval

Answer

A

Finding unstructured documents that satisfy an information need within large collections

Technologies for document indexing, search and retrieval
- e.g. search engines

Not only google or bing
- specialised search

Not only English

IR is not only text
- speech question answering - Hey Siri
- music retrieval - Shazam

Question 2

Q

Query

Answer

A

What the user conveys to the computer in an attempt to communicate the information need

Question 3

Q

Unstructured Data

Answer

A

In contrast to rigidly structured DBs, no obvious or easy-for-computer to deal with structure
- Could be some structure
– E.g. titles, paragraphs, HTML, XML, etc…

Question 4

Q

Information need

Answer

A

topic about which the user desires to address (search for) on the web

Question 5

Q

Navigational Need

Answer

A

Query to take the user to a page (~10% of queries)
- typically company/business/org name
- Domain suffix
- Length of query < 3

Question 6

Q

Transactional need

Answer

A

~10% of queries focus on doing something on the web (e.g. buying, downloading, booking, etc…)

Question 7

Q

Informational need

Answer

A

Collect/retrieve/obtain some information
Use of question words, many phrases
Neither navigational nor transactional
Query length > 2

Question 8

Q

When is a document relevant?

Answer

A

If the user perceives that it contains information of value with respect to their information need

Question 9

Q

Measuring document relevance

Answer

A

Does it satisfy the users need

Has the user clicked on it?
How long did they spend on it?
Have they clicked on another?
Have they reformulated the query?
Is this a new query?

Question 10

Q

Representing queries

Answer

A

Keywords to model information need

Question 11

Q

Representing documents

Answer

A

Reduce each document to a set of index terms
- ‘meaning’ of document approximated by index terms

What should be in the index?
Words? Names? Dates? Set of pre-defined tags?

Search is then focused on the index
- Map user queries to the index i.e. determine the similarity of a query to a document via index terms

Question 12

Q

Indexing process

Answer

A

Acquire documents through crawling, feeds, deposits, etc…

Generate (compute) an index for each document

Store the index in an easy-to-search form

Question 13

Q

Index Search (retrieval) process

Answer

A

Assist the user in formulating the query
Transform the query (“index” it)
Map the query’s representation against the index
Retrieve and rank the results
Help users browse the results (and re-formulate the query)
Log user’s actions, etc…

Question 14

Q

How to choose index terms

Answer

A

“(most) important words?”?
Using titles? Words in bold? Quotes?
What (if any) background knowledge did you use?
What about combining or structuring words?
Have you though about possible queries?

Have you “ranked” your words?
- How to decide on importance/relevance
- more generic vs more specific words

How about stop words? (the,and,of,it,…)

Question 15

Q

Term-document frequency matrix

Answer

A

rows of terms, columns of documents, each cell is the frequency of a term in the document

Sparse, massive matrices

Question 16

Q

Term-document incidence matrix

Answer

A

rows of terms, columns of documents, each cell is 1 if the term is in the document or 0 otherwise

Sparse, massive matrices

Question 17

Q

Inverted Index

Answer

A

For each term t, store all the documents that contain t, good to store frequency and position too.

Sorted by terms first, then docID

Question 18

Q

Boolean Queries

Answer

A

Simplest query model: find all document from the collection that fully match the query
- Binary outcome for each document yes/no

Use operators:
AND
OR
NOT

Can combine operators e.g. AND NOT

Many systems hide the operators (e.g. space is used as AND)

Pros:
- Simple model: everything is considered as a set of terms
- Easy/efficient to implement
- Precise (document either matches or doesn’t match)
- Widely used for commercial, legal retrieval and for specialist searches (long, precise queries)
- Works well when we know what we want: user feels in control

Cons:
- Users are not good in formulating queries: possible confusion with natural language
- Feast or famine: AND gives too few results, OR gives too many
- No relevance ordering / ranking of results
- Ignores word order, word frequency, etc…
- Documents that are ‘close’ are not retrieved

Question 19

Q

Boolean Queries AND Merge

Answer

A

Term1 AND Term2
Find Term1 and Term2 in the dictionary

Merge the results so that you are left with the documents that occur in both

Crucial postings are sorted by docID to do this merge efficiently

Question 20

Q

Extended boolean model

Answer

A

Proximity operators
Embed term positions in the inverted index (proximity index)

Queries can then refer to:
/n - (e.g. /3 within 3 words)
/s - in the same sentence, /p in the same paragraph
+s = term1 must proceed term2 in same sentence

Question 21

Q

Ranked retrieval

Answer

A

Introduce similarity and ranking of matched documents
- Score each document to say how well it matches a query
- Typically aim to get top 10 ones correct (user typically will not look further)

Idea: Use vector representation for both documents and queries, and calculate their similarity

Question 22

Q

Vector representations

Answer

A

Represent both documents and queries as vectors in the same space

Rank (all) documents according to their proximity in that given space

Do this to get away from the either-you’re-in-or-out model

Question 23

Q

Term frequency

Answer

A

tf or term t and document d is the number of times t appears in document d

Relevance does not increase proportionally with term frequency

Question 24

Q

Document frequency

Answer

A

In how many documents a term appears

Rare terms are more informative and discriminative than frequent terms for information retrieval

Question 25

Q

Collection frequency

Answer

A

The number of occurrences of t in the collection (counting multiple occurrences in the same document)

With equal collection frequency, the term with lower document frequency is the more informative “better” term

Question 26

Q

Inverse document Frequency

Answer

A

idf = log(N/df) where N is the number of documents

Based of the log is immaterial, log is used to dampen the effect

Question 27

Q

Tf-Idf

Answer

A

Term frequency vs Inverse Document Frequency

tf - the number of times a term appears in a document
idf - Log(N/df) where N is the number of documents
df - The number of documents a term appears in

Increases
- With the number of occurrences within a document
- With the rarity of the term in the collection

Question 28

Q

tf-idf document ranking for a query

Answer

A

Score of a document d is sum over all tf-idf weights of each query term t found in d

Indexing: For each term and document calculate tf*idf

Typical output: A list of documents ranked according to score (q,d)

Question 29

Q

What’s wrong with VSMs for information retrieval

Answer

A

Vector space models work reasonably well, but:
- Based on bag of words, so ignore context

Heuristic in nature
– Not notion of relevance, similarity only
– Weights seem to work - mostly engineering (rather than theory)
Don’t adapt to the user or collection
– Term weights should be user and domain specific
– How to “train” on a particular collection

Does not model uncertainty

Question 30

Q

PageRank

Answer

A

See Page Rank slides

Question 31

Q

Adapt ranking to user clicks?

Answer

A

Higher positions receive more user attention (eye fixation) and clicks than lower positions

This is true even in the setting that the order of positions is reversed

They are “Informative but biased”

Question 32

Q

Document relevant feedback

Answer

A

User feeds back on relevance of docs in an initial set of results
- User issues a (short, simple) query
- The user marks some results as relevant or non-relevant
- The system computes a better representation of information need based on feedback
- Relevant feedback can go through one or more iterations

Idea: It may be difficult to formulate a good query when you don’t the collection well, so iterate

We can modify the query based on relevance feedback and apply a standard vector space model

Relevance feedback can import precision and recall.

It is most useful for improving recall in situations where recall is important (users can be expected to review results and to take time to iterate)

Problems:
- Users are often reluctant to give explicit feedback
- Often hard to understand why a particular document was retrieved after applying relevance feedback

Long queries are inefficient for typical IR engine
- Long response times for user
- High cost for retrieval system
- Partial solution
– Only rewrite certain prominent terms
— Perhaps top 20 by term frequency

Question 33

Q

Query Expansion

Answer

A

In relevance feedback, users give additional input (relevant / non-relevant) on documents which is used to rewrite terms in the documents

In query expansion, users give additional input (good/bad search terms) on words or phrases

Add more words of the same (or more specific) meaning to your query for better retrieval

Could use a thesaurus, automatically derived thesaurus. Refinements based on query log mining (common on the web)

Question 34

Q

Thesaurus based query expansion

Answer

A

For each term t in a query, expand the query with synonyms and related words of t from a thesaurus

feline -> feline cat

May weight added terms less than original query terms

Generally increases recall but may significantly decrease precision, particularly with ambiguous terms.
“interest rate” -> “interest rate fascinate evaluate”