C5: neural IR Flashcards

Question 1

Q

masked language modelling

Answer

A

remove a word from a context and train a neural network to predict the masked word (probability distribution over all words in the vocabulary)

self-supervised: supervised but we don’t need to present the labels ourselves

Question 2

Q

2 different term representations

Answer

A

term vector space:
- sparse and high-dimensional (|V|)
- observable
- can be used for exact matching

embeddings:
- dense and lower-dimensional (100s)
- latent
- can be used for inexact matching

Question 3

Q

transformer architecture

Answer

A

sequence-to-sequence (encoder-decoder)
uses self-attention: computes strength of relation between pairs of input words (dot product)
can model long-distance term dependencies because the complete input is processed at once
quadratic complexity

Question 4

Q

BERT

Answer

A

neural network model for generating contextual embeddings

Bidirectional Encoder Representations from Transformers

pre-training and fine-tuning: transfer learning

Question 5

Q

machine learning for ranking

Answer

A

most straightforward:
1. learn a probabilistic classifier or regression model on query-document pairs with relevance labels
2. apply to unseen pairs and get a score for each query-document pair
3. rank documents per query by prediction score

Question 6

Q

pointwise learning

Answer

A

learning the relevance value per query-document pair, then sorting by the predicted values

loss function: sum of squared differences between each true and assigned score and take the average

limation: does not consider relative ranking between items in the same list, only absolute numbers

Question 7

Q

pairwise learning

Answer

A

consider pairs of relevant and nonrelevant documents for the same query and minimize the number of incorrect inversions in the ranking

loss function: pair wise hinge
L_hinge = max(0, 1 - (score(q,d_i) - score(q,d_j)))
and sum over all pairs (d_i, d_j) with d_i more relevant than d_j

limitations: every document pair is treated as equally important, but misrankings in higher positions are more severe

Question 8

Q

two-stage retrieval

Answer

A

use the embeddings model to re-rank the top documents retreived by a lexical IR model

step 1: lexcial retrieval from the whole corpus with BM25 or LM
step 2: re-ranking of top-n retrieved documents with supervised BERT model

Question 9

Q

monoBERT

Answer

A

two-input classification (cross-encoder)

output is the representation of the CLS token
- used as input to single-layer fully connected neural network
- to obtain the probability that candidate d is relevant to q
- followed by softmax for the relevance classification

Question 10

Q

BERT ranking and inference set-up

Answer

A

training:
1. start with pre-trained BERT model
2. fine-tune on query-document pairs with relevance assessments

inference:
1. retrieve 100-1000 documents with lexical retrieval model
2. apply trained monoBERT to all retrieved q,d pairs and output score
3. for each query, rank the documents by this score

Question 11

Q

BERT has a maximum input length of 512 tokens. How to deal with longer documents?

Answer

A

truncate input texts, or: split documents into passages => challenges:
- training: labels are given on document level => what to feed the model
- inference: we need to aggregate the scores per passage into a document score

Question 12

Q

how can we aggregate passage scores?

Answer

A

BERT-MaxP: passage score aggregation
- training: treat all passages from a relevant document as relevant and all passages from a non-relevant document as not relevant
- inference: estimate the relevance of each passage, then take the maximum passage score (MaxP) as the document score

PARADE: passage representation aggregation
- training: the same
- inference: aggregate the representations of passages rather than aggregating the scores of individual passages (averaging the [CLS] representation from each passage)

OR use passage-level relevance labels or transformer architectures for long texts

C5: neural IR Flashcards

(12 cards)