W1 Intro Flashcards
What’s Information Retrieval?
Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).
Key words:
Collection of unstructured documents
User
information need
relevance
query
Generally what does IR system do?
- Analyzing queries & docs
- Retrieve docs - compute relevance score of (query, doc)
- ranking docs by score
Comparison between Search Engine and recommender system?
Search Engines:
* User has information need
* User types query
* System returns docs ranked by relevance to the query (based on matcg/popularity)
Recommender System:
* User has interest and goes to certain platforms
* System returns items relevant to the user (based on profile/popularity)
4 principles of IR
Principles of…
- Relevance
- Ranking
- Text Processing
- User Interaction
What’s Two-stage Retrieval?
Query -> Initial Retrieval from full collection -> Re-ranking top-n results with trained re-ranker
1st stage:
from large collection, unsupervised, term-based (sparse)
priority is RECALL
2nd stage:
rank top-n doc from 1st stage, supervised, embedding-based (dense)
priority is PRECISION
1) What’s the most used term-based retrieval model?
2) What mechanism does it use?
3) What’s the problem of the model?
1) BM25
2) exact match, term weighting (tf-idf)
3) BM25 is based on exact match, Searchers sometimes use different terms to describe their information needs than what authors of the relevant documents used (vocabulary mismatch)
How to solve exact term matching problem?
Use semantic matching
Semantic matching model are embedding-based (low dimension, dense vector representation)
Relevance factors: how do you measure relevance?
- Term overlap: return docs containing the query terms
- Doc importance: PageRank
- Result popularity: clicks by other users
- Diversification: different types of results
- Semantic similarity: docs with semantic representation close to the query
Comparison between Index Time and Query Time?
Index Time:
* Collecti new docs
* Pre-process docs
* create doc representation
* store docs in index
* index can take time
Query Time:
* process user query
* match query to index
* retrieve docs potentially relevant
* rank docs by relevance score
* Needs to be real-time
How to process documents in term-based retrieval model?
- split doc & query in terms
- Term normalization:
lowercase, remove diacritics, stop words, stem/lemma, map similar words together
All word forms that have a place in the index are called terms
All terms together are the vocabulary of the index
What’s the role of users in IR?
What are possible challanges when users are involved?
- User has information need
- User defines what’s relevant
- User interacts with the search engine
Challenges:
* The real user need is hidden
* User queries are short & ambiguous
* Natural language is unstructured, noisy, sparse, multilingual
Fill in the blanks:
The most widely used application of ranking is web search. In web search, a user enters a ____(1)____ and the search engine returns a list of results. The results are retrieved from an ____(2)____. The result page of the search engine shows a list of short descriptions of the documents. These short descriptions are called ____(3)____.
The most widely used application of ranking is web search. In web search, a user enters a (1) query and
the search engine returns a list of results. The results are retrieved from an (2) index. The result page of
the search engine shows a list of short descriptions of the documents. These short descriptions are called
(3) snippets.
Fill in the blanks:
The results are ranked by their ____(4)____ as estimated by the search engine. An important part of this estimation is exact term matching. The most successful and most used scoring function for exact term matching is ____(5)____. It is still commonly used as initial ranking model, both in commercial and academic contexts.
The results are ranked by their (4) relevance as estimated by the search engine. An important part of this
estimation is exact term matching. The most successful and most used scoring function for exact term
matching is (5) BM25. It is still commonly used as initial ranking model, both in commercial and academic
contexts.
Fill in the blanks:
One shortcoming of exact matching is the ____(6)____ mismatch problem, the problem that a relevant document uses different words than the query and is not retrieved. For that reason, exact matching models are commonly combined with “soft” matching or ____(7)____ matching models. These models are based on dense, continuous vector representations called ____(8)____.
One shortcoming of exact matching is the (6) vocabulary mismatch problem, the problem that a relevant
document uses different words than the query and is not retrieved. For that reason, exact matching
models are commonly combined with “soft” matching or (7) semantic matching models. These models are
based on dense, continuous vector representations called (8) embeddings.
Fill in the blanks:
The common architecture for ranking is a two-stage approach: in the first stage, an exact matching model is used to retrieve an initial set of items, and in the second stage these items are re-ranked with a less efficient but more effective ranker. While first-stage rankers are generally ____(9)____ (with only a few free hyperparameters to tune), the second stage rankers are ____(10)____, trained on data with relevance labels.
The common architecture for ranking is a two-stage approach: in the first stage, an exact matching model
is used to retrieve an initial set of items, and in the second stage these items are re-ranked with a less
efficient but more effective ranker. While first-stage rankers are generally (9) unsupervised (with only a
few free hyperparameters to tune), the second stage rankers are (10) supervised, trained on data with
relevance labels.