C9: neural IR Flashcards
why would we want dense retrieval?
sometimes we need exact matching, but often we also want in-exact matching of documents to queries: if we use exact matching in 1st stage we might miss relevant documents
what is dense retrieval?
neural first-stage retrieval, so using embeddings
- bi-encoder architecture: encoding the query and document independently, then computing the relevance
bi-encoder architecture: 3 steps
- generate a representation of the query that captures the information need
- generate a representation of the document that captures the information contained
- match the query and the document representations to estimate their mutual relevance
how do we measure the relevance between query and document?
use a function to compute the similarity between the query and the document representation vectors
what are the 4 differences between cross-encoders and bi-encoders?
- cross: one encoder for q and d
bi: separate encoders for q and d - cross: full interaction between words in q and d
bi: no interaction between words in q and d - cross: higher quality ranker than bi
- cross: only possible in re-ranking
bi: highly efficient (also in 1st stage)
Sentence-BERT
commonly used bi-encoder, originally designed for sentence similarity but can be used for q,d pairs
it is a pointwise model, because we only take one d into account per learning item. At inference we measure the similarity between q and each d and then sort the docs by this similarity
what is the goal of training bi-encoders?
the similarity between the 2 vectors is maximized for docs relevant to q and minimized for non-relevant docs to q, given the similarity function
why are bi-encoders less effective than cross-encoders?
cross-encoders can learn relevance signals from attention between the query and candidate texts at each transformer encoder layer
ColBERT
proposed as a model that has the effectiveness of cross-encoders and the efficiency of bi-encoders
- compatible with nearest neighbour search techniques
what is nearest-neighbour search?
finding which document embeddings vector are most similary to the query embeddings vector
computing the similarity for each d,q pair is not scalable => approximate nearest-neighbour (ANN) search
similarity in ColBERT
similarity between d and q is the sum of maximum cosine similarities between each query term and the best matching term in d: compute the similarity of each query term to every document term and sum the maximum similarities
ColBERT: query time
- query term embedding
- 1st stage: the top-k texts from the corpus are retrieved for each query embedding
- 2nd stage: k candidate texts are scored using all query token representations according to the MaxSim operator and then ranked
ColBERT loss function
L(q,d+, d-) = - log exp(s_q,d+) / ( exp(s_q,d+) + exp(s_q,d-))
challenges of long documents
- memory burden of reading the whole document in the encoder
- mixture of many topics, query matches may be spread
- neural model must aggregate the relevant matches from different parts
challenges of short documents
- fewer query matches
- but neural model is more robust towards the vocabulary mismatch problem than term-based matching models