C1 Flashcards

1
Q

information retrieval

A

finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

how do IR systems support the search process?

A
  • analyzing queries and documents
  • retrieving documents by computing a relevance score for each document given a query
  • ranking the documents by that relevance score (true relevance is typically binary)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how does a search engine work? (basic)

A
  • user has an information need
  • user types a query
  • system returns item sorted by relevance to that query, based on match on the query and popularity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

how do recommender systems work? (basic)

A
  • user has an interest
  • user goes to a service/app/website
  • system provides items that are relevant to the user based on user history/profile and popularity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

4 principles of IR

A

principles of
- relevance
- ranking
- text processing
- user interaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

principles of relevance

A
  • term overlap: return documents that contain the query terms (or related terms?)
  • document importance (PageRank)
  • result popularity: clicks by previous users on the document
  • diversification: different types of results for different interpretations of the query
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

evaluation of relevance

A

needed:
- a set of queries
- a document collection
- relevance assessment: set of documents for each query that are labelled as relevant or non-relevant

the retrieval system returns a ranking of all documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

principles of ranking

A
  • estimate relevance for each document
  • we need a score
  • term weighting (basic notion)
  • PageRank
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

machine learning for ranking

A

Idea: learn the relevance of a document based on human-labelled training data (relevance assessments)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is machine learning for ranking different from machine learning for classification?

A

Relevance depends on the query, so we cannot train a global classifier over all relevant and irrelevant documents in a labelled dataset
=> need a machine learning paradigm that includes the query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

two-stage retrieval first stage

A
  • from large collection
  • unsupervised
  • often term-based (sparse)
  • priority: recall
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

two stage retrieval second stage

A
  • ranking top-n documents from first stage
  • supervised
  • often based on embeddings (dense)
  • priority: precision at high ranks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

2 different relavance models for similarity

A
  • term overlap: find documents that contain query words
  • semantic similarity: find documents which semantic representation is close to the query
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

index time

A
  • collect (new) documents
  • pre-process the documents
  • create the documents representation
  • store the documents in the index
  • indexing can take time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

query time

A
  • process the user query
  • match the query to the index
  • retrieve the documents that are potentially relevant
  • rank the documents by relevance score
  • retrieval cannot take time (< 1 sec)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

term-based retrieval models

A
  • split documents and queries into terms
  • normalize terms to a common form (lowercase, remove diacritics and stop words, stemming/lemmatization, map similar terms together)
  • most used model is BM25
17
Q

problem with term-based retrieval

A

a vocabulary mismatch between query and document
solution: semantic matching, based on embeddings representations of texts

18
Q

the role of users in IR

A
  • user has an information need
  • user defines what is relevant
  • user interacts with search engine
19
Q

challenges with user interaction

A
  • real user is hidden
  • user queries are short and ambiguous (what does the query refer to? what should be the mode of the answer? what type of information is the user interested in?)
  • natural text is unstructured, noisy, redundant, infinite, sparse and multilingual