C5: neural IR Flashcards
masked language modelling
remove a word from a context and train a neural network to predict the masked word (probability distribution over all words in the vocabulary)
self-supervised: supervised but we don’t need to present the labels ourselves
2 different term representations
term vector space:
- sparse and high-dimensional (|V|)
- observable
- can be used for exact matching
embeddings:
- dense and lower-dimensional (100s)
- latent
- can be used for inexact matching
transformer architecture
- sequence-to-sequence (encoder-decoder)
- uses self-attention: computes strength of relation between pairs of input words (dot product)
- can model long-distance term dependencies because the complete input is processed at once
- quadratic complexity
BERT
neural network model for generating contextual embeddings
Bidirectional Encoder Representations from Transformers
pre-training and fine-tuning: transfer learning
machine learning for ranking
most straightforward:
1. learn a probabilistic classifier or regression model on query-document pairs with relevance labels
2. apply to unseen pairs and get a score for each query-document pair
3. rank documents per query by prediction score
pointwise learning
learning the relevance value per query-document pair, then sorting by the predicted values
loss function: sum of squared differences between each true and assigned score and take the average
limation: does not consider relative ranking between items in the same list, only absolute numbers
pairwise learning
consider pairs of relevant and nonrelevant documents for the same query and minimize the number of incorrect inversions in the ranking
loss function: pair wise hinge
L_hinge = max(0, 1 - (score(q,d_i) - score(q,d_j)))
and sum over all pairs (d_i, d_j) with d_i more relevant than d_j
limitations: every document pair is treated as equally important, but misrankings in higher positions are more severe
two-stage retrieval
use the embeddings model to re-rank the top documents retreived by a lexical IR model
step 1: lexcial retrieval from the whole corpus with BM25 or LM
step 2: re-ranking of top-n retrieved documents with supervised BERT model
monoBERT
two-input classification (cross-encoder)
output is the representation of the CLS token
- used as input to single-layer fully connected neural network
- to obtain the probability that candidate d is relevant to q
- followed by softmax for the relevance classification
BERT ranking and inference set-up
training:
1. start with pre-trained BERT model
2. fine-tune on query-document pairs with relevance assessments
inference:
1. retrieve 100-1000 documents with lexical retrieval model
2. apply trained monoBERT to all retrieved q,d pairs and output score
3. for each query, rank the documents by this score
BERT has a maximum input length of 512 tokens. How to deal with longer documents?
truncate input texts, or: split documents into passages => challenges:
- training: labels are given on document level => what to feed the model
- inference: we need to aggregate the scores per passage into a document score
how can we aggregate passage scores?
BERT-MaxP: passage score aggregation
- training: treat all passages from a relevant document as relevant and all passages from a non-relevant document as not relevant
- inference: estimate the relevance of each passage, then take the maximum passage score (MaxP) as the document score
PARADE: passage representation aggregation
- training: the same
- inference: aggregate the representations of passages rather than aggregating the scores of individual passages (averaging the [CLS] representation from each passage)
OR use passage-level relevance labels or transformer architectures for long texts