C5: neural IR Flashcards

1
Q

masked language modelling

A

remove a word from a context and train a neural network to predict the masked word (probability distribution over all words in the vocabulary)

self-supervised: supervised but we don’t need to present the labels ourselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2 different term representations

A

term vector space:
- sparse and high-dimensional (|V|)
- observable
- can be used for exact matching

embeddings:
- dense and lower-dimensional (100s)
- latent
- can be used for inexact matching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

transformer architecture

A
  • sequence-to-sequence (encoder-decoder)
  • uses self-attention: computes strength of relation between pairs of input words (dot product)
  • can model long-distance term dependencies because the complete input is processed at once
  • quadratic complexity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

BERT

A

neural network model for generating contextual embeddings

Bidirectional Encoder Representations from Transformers

pre-training and fine-tuning: transfer learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

machine learning for ranking

A

most straightforward:
1. learn a probabilistic classifier or regression model on query-document pairs with relevance labels
2. apply to unseen pairs and get a score for each query-document pair
3. rank documents per query by prediction score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

pointwise learning

A

learning the relevance value per query-document pair, then sorting by the predicted values

loss function: sum of squared differences between each true and assigned score and take the average

limation: does not consider relative ranking between items in the same list, only absolute numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

pairwise learning

A

consider pairs of relevant and nonrelevant documents for the same query and minimize the number of incorrect inversions in the ranking

loss function: pair wise hinge
L_hinge = max(0, 1 - (score(q,d_i) - score(q,d_j)))
and sum over all pairs (d_i, d_j) with d_i more relevant than d_j

limitations: every document pair is treated as equally important, but misrankings in higher positions are more severe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

two-stage retrieval

A

use the embeddings model to re-rank the top documents retreived by a lexical IR model

step 1: lexcial retrieval from the whole corpus with BM25 or LM
step 2: re-ranking of top-n retrieved documents with supervised BERT model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

monoBERT

A

two-input classification (cross-encoder)

output is the representation of the CLS token
- used as input to single-layer fully connected neural network
- to obtain the probability that candidate d is relevant to q
- followed by softmax for the relevance classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

BERT ranking and inference set-up

A

training:
1. start with pre-trained BERT model
2. fine-tune on query-document pairs with relevance assessments

inference:
1. retrieve 100-1000 documents with lexical retrieval model
2. apply trained monoBERT to all retrieved q,d pairs and output score
3. for each query, rank the documents by this score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

BERT has a maximum input length of 512 tokens. How to deal with longer documents?

A

truncate input texts, or: split documents into passages => challenges:
- training: labels are given on document level => what to feed the model
- inference: we need to aggregate the scores per passage into a document score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how can we aggregate passage scores?

A

BERT-MaxP: passage score aggregation
- training: treat all passages from a relevant document as relevant and all passages from a non-relevant document as not relevant
- inference: estimate the relevance of each passage, then take the maximum passage score (MaxP) as the document score

PARADE: passage representation aggregation
- training: the same
- inference: aggregate the representations of passages rather than aggregating the scores of individual passages (averaging the [CLS] representation from each passage)

OR use passage-level relevance labels or transformer architectures for long texts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly