W1 Intro Flashcards

Question 1

Q

What’s Information Retrieval?

Answer

A

Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).

Key words:
Collection of unstructured documents
User
information need
relevance
query

Question 2

Q

Generally what does IR system do?

Answer

A

Analyzing queries & docs
Retrieve docs - compute relevance score of (query, doc)
ranking docs by score

Question 3

Q

Comparison between Search Engine and recommender system?

Answer

A

Search Engines:
* User has information need
* User types query
* System returns docs ranked by relevance to the query (based on matcg/popularity)

Recommender System:
* User has interest and goes to certain platforms
* System returns items relevant to the user (based on profile/popularity)

Question 4

Q

4 principles of IR

Answer

A

Principles of…

Relevance
Ranking
Text Processing
User Interaction

Question 5

Q

What’s Two-stage Retrieval?

Answer

A

Query -> Initial Retrieval from full collection -> Re-ranking top-n results with trained re-ranker

1st stage:
from large collection, unsupervised, term-based (sparse)
priority is RECALL

2nd stage:
rank top-n doc from 1st stage, supervised, embedding-based (dense)
priority is PRECISION

Question 6

Q

1) What’s the most used term-based retrieval model?
2) What mechanism does it use?
3) What’s the problem of the model?

Answer

A

1) BM25
2) exact match, term weighting (tf-idf)
3) BM25 is based on exact match, Searchers sometimes use different terms to describe their information needs than what authors of the relevant documents used (vocabulary mismatch)

Question 7

Q

How to solve exact term matching problem?

Answer

A

Use semantic matching

Semantic matching model are embedding-based (low dimension, dense vector representation)

Question 8

Q

Relevance factors: how do you measure relevance?

Answer

A

Term overlap: return docs containing the query terms
Doc importance: PageRank
Result popularity: clicks by other users
Diversification: different types of results
Semantic similarity: docs with semantic representation close to the query

Question 9

Q

Comparison between Index Time and Query Time?

Answer

A

Index Time:
* Collecti new docs
* Pre-process docs
* create doc representation
* store docs in index
* index can take time

Query Time:
* process user query
* match query to index
* retrieve docs potentially relevant
* rank docs by relevance score
* Needs to be real-time

Question 10

Q

How to process documents in term-based retrieval model?

Answer

A

split doc & query in terms
Term normalization:
lowercase, remove diacritics, stop words, stem/lemma, map similar words together

All word forms that have a place in the index are called terms
All terms together are the vocabulary of the index

Question 11

Q

What’s the role of users in IR?
What are possible challanges when users are involved?

Answer

A

User has information need
User defines what’s relevant
User interacts with the search engine

Challenges:
* The real user need is hidden
* User queries are short & ambiguous
* Natural language is unstructured, noisy, sparse, multilingual

Question 12

Q

Fill in the blanks:

The most widely used application of ranking is web search. In web search, a user enters a ____(1)____ and the search engine returns a list of results. The results are retrieved from an ____(2)____. The result page of the search engine shows a list of short descriptions of the documents. These short descriptions are called ____(3)____.

Answer

A

The most widely used application of ranking is web search. In web search, a user enters a (1) query and
the search engine returns a list of results. The results are retrieved from an (2) index. The result page of
the search engine shows a list of short descriptions of the documents. These short descriptions are called
(3) snippets.

Question 13

Q

Fill in the blanks:
The results are ranked by their ____(4)____ as estimated by the search engine. An important part of this estimation is exact term matching. The most successful and most used scoring function for exact term matching is ____(5)____. It is still commonly used as initial ranking model, both in commercial and academic contexts.

Answer

A

The results are ranked by their (4) relevance as estimated by the search engine. An important part of this
estimation is exact term matching. The most successful and most used scoring function for exact term
matching is (5) BM25. It is still commonly used as initial ranking model, both in commercial and academic
contexts.

Question 14

Q

Fill in the blanks:
One shortcoming of exact matching is the ____(6)____ mismatch problem, the problem that a relevant document uses different words than the query and is not retrieved. For that reason, exact matching models are commonly combined with “soft” matching or ____(7)____ matching models. These models are based on dense, continuous vector representations called ____(8)____.

Answer

A

One shortcoming of exact matching is the (6) vocabulary mismatch problem, the problem that a relevant
document uses different words than the query and is not retrieved. For that reason, exact matching
models are commonly combined with “soft” matching or (7) semantic matching models. These models are
based on dense, continuous vector representations called (8) embeddings.

Question 15

Q

Fill in the blanks:

The common architecture for ranking is a two-stage approach: in the first stage, an exact matching model is used to retrieve an initial set of items, and in the second stage these items are re-ranked with a less efficient but more effective ranker. While first-stage rankers are generally ____(9)____ (with only a few free hyperparameters to tune), the second stage rankers are ____(10)____, trained on data with relevance labels.

Answer

A

The common architecture for ranking is a two-stage approach: in the first stage, an exact matching model
is used to retrieve an initial set of items, and in the second stage these items are re-ranked with a less
efficient but more effective ranker. While first-stage rankers are generally (9) unsupervised (with only a
few free hyperparameters to tune), the second stage rankers are (10) supervised, trained on data with
relevance labels.

Question 16

Q

Fill in the blanks:

The common architecture for ranking is a two-stage approach: in the first stage, an exact matching model is used to retrieve an initial set of items, and in the second stage these items are re-ranked with a less efficient but more effective ranker. While first-stage rankers are generally ____(9)____ (with only a few free hyperparameters to tune), the second stage rankers are ____(10)____, trained on data with relevance labels.

Answer

Study These Flashcards

A

The common architecture for ranking is a two-stage approach: in the first stage, an exact-matching model
is used to retrieve an initial set of items, and in the second stage these items are re-ranked with a less
efficient but more effective ranker. While first-stage rankers are generally (9) unsupervised (with only a
few free hyperparameters to tune), the second stage rankers are (10) supervised, trained on data with
relevance labels.

W1 Intro Flashcards

(16 cards)