IR – Dense Retrieval Flashcards
Lecture 6
What is re-ranking?
Re-ranking is changing the ranking of the list of results (documents retrieved from the database for example).
We can calculate the score of some document based on the query using an interface score(q, d) and then retrieve 1000 documents with the highest score (ranking). This can be done using BM25 ranker, and then some neural re-ranker to retrieve top 10 of the 1000 documents. Neural approach is usually much slower than BM25, but more accurate
Explain dense retrieval with re-ranking
Compared to normal two-stage ranking (BM25 + neural model), the first stage is replaced with some BERT model that encodes the documents and using nearest neighbour vector search, it finds top 1000 documents.
The second stage (optional) can also be some neural re-ranking model. It is optional depending on the execution time restrictions, it is usually slower.
How is the training of neural re-rankers done?
They are usually trained with triplets (query, 1 relevant and 1 non-relevant document)
Then the embeddings are created for all 3, and the model should maximize the margin between rel/non-rel document (loss function).
When training, this is done in batches. Each batch has as many triplets as allowed in GPU memory (16 - 128). Different queries are mixed together in each batch, not for a single query to have different documents. Then, we run a backwards propagation (pass), calculate the gradient, and update the weights.
To deal with different sizes of input (text), we need to do some padding to fit into the matrix.
What are some loss functions?
Plain Margin Loss:
loss = max(0, Snonrel - Srel + 1)
S is the metric of relevance between a query and some rel/nonrel document. If the ranker tells us that some non-relevant document is very close to query, this loss is high (Snonrel is high). Vice versa, if the network correctly says that nonrel is low, and rel is high, then loss is 0 and nothing has to be changed.
RankNet:
loss = BinaryCrossEntropy(Srel - Snonrel)
How to sample non-relevant passages/documents?
Most collections of documents only have labels of relevant documents, but not for the non-relevant ones. It is too costly.
The simple approach is to run BM25 algorithm to get the top-1000 documents per query, and randomly select them. WE WANT some noise in the data, we don’t want COMPLETELY irrelevant documents. We do want some relevance but not too much (as non-relevant documents). If the query is about universe and space, and we say some document about dogs and cats is the non-relevant one, the model won’t learn anything because it is too obvious. On the other hand, it a non-relevant document is about earth’s athmosphere, it is much better because the topic is close but not relevant.
What is a normal dense retrieval lifecycele?
- Training: very expensive process of training some BERT model (for example) to encode documents. Not needed if we use some pre-trained model
- Indexing: all documents have to be indexed/encoded into some embedding/vector and stored in a vector database for later use.
- Searching. When some query comes, it has to be vectorised (with the same model as documents), and the nearest neighbour search is performed to find top X documents (vectors closest to the query-vector).
Explain BERTdot
BERTdot is a dense retrieval model based on the BERT architecture. Query and passages are encoded independently using BERT model. Passages are computed offline. Then, query comes in and CLS token (like with any BERT model) is in the beginning of an input sequence. This CLS + query is passed to BERT and the output of CLS token is the vector embedding of the query. Then, by comparing it with all preprocessed passage embeddings/vectors (similarity search), we find the passages with the highest score.
Similarity seach can be a dot-product by just doing matrix multiplication.
Called BERTdot because you compute embeddings for the query and for documents, and then compute the dot product between them and find the similarities
Explain the purpose of [CLS] token
CLS token: BERT model is trained in such a way that it puts its output into the CLS token. It represents the combination of all tokens. THis is how BERT generally works. As an output of BERT, we just take a look at the embedding for CLS token. through BERT’s layers, the model learns to encode global context into the [CLS] token’s hidden state using self-attention mechanisms. Using self attention, where each token attends to each other token (including CLS token), the self-attention is trained to aggregate the representation of all tokens into this CLS token.
What are some of the indexing techniques?
- Flat Index (brute force): convert documents into embeddings and store them in db. Then convert query and compare it with every single embedding (matrix multiplication, fast with a lot of GPU)
- Inverted Flat Index: Convert to embeddings but compute beforehand the clustering. Take the query-embedding and first compare it to cluster centroids, and then with the vectors within those clusters. A bit memory overhead but large reduction in search space
- Product Quantization: Replace the original vectors with lower dimensional vectors of integers. This simplifies the matrix multiplications.
- Graph Indicies: using hirerachical navigatable small worlds, you have multiple dimensions of graphs, First one has the smalles number of nodes(documents) and you navigate quickly, then you come as closes to the actual document as possible. Than move up the hirerachy and continue doing it. Idea is the transportation of continents/cities/streets
What is the Flat Index indexing technique?
It is a brute force way of finding closes neighbours. It calculates the distance between vectors for each pair, and it retrieves the closest one. The highest accuracy but pretty slow.
What is Inverted File Index indexing technique?
We first partition the dataset into clusters. Each cluster has a centroid, and we store the centroid vector and a list of vectors assigned to this cluster.
At query time, we only compute the similarity between query and controids. Then, we select top n clusters and compute similarity to all vectors inside of these n clusters.
Reduction in search space but there is overhead in clustering and storing all this additional inverted index and clusters.
What is Product Quentization indexing technique?
The idea is to replace the original vector of float values to a lower-dimenstion vector of integers.
Each document-vector is split into m pieces, and these pieces are encoded into one value. This is done using k-means clustering where each of the m pieces is assigned to some cluster and id of that cluster is stored in the lower-dimension lector.
Instead of n * d (where d is the vector size), we store n * m. Additional preprocessing is needed to store distance from sub-vectors to centroids.
At query time, query is encoded into m pieces in the same way. Then, the distance is approximated from query to doc by sum of stored distances from doc sub-vector to query cluster centroid.
The results are approximate and it depends on splits and clustering parameters. Much faster and memory-efficient.
What is Graph Indices indexing technique?
We use graph structure for this. It is assumed that nodes are documents and edges are similar documents. The entry point is pre-defined and we visit similar documents until no nearer nodes are found.
Now, to be more efficient, we can have multiple of these graphs. The highest level graph has the lowest number of nodes/documents, but it is much faster to traverse through the space (if the document is super far away, we don’t have to visit every single document along the way, we have less documents to see).
When we find the local minimum, we drop one layer and keep searching for the local minimum (can be thought as using plane, then the train, then the car, and then the bycicle to come to a certain street)
What is the Knowledge Distillation?
Goal is to condense the knowledge, it doesn’t give us better results, just faster.
When we have one Teacher neural network, very large with high accuracy but it is very slow. We need something faster. We can train a Student neural network, smaller and more efficient, that has been trained on Teacher’s outputs (or class distributions or smth else) as the signal.
If we use just an output of Teacher as the Student’s signal, we can be independent of the Teacher’s architecture, but if we also take some activations, attention distribution etc. then we are fixed with the architecture. On the other side, we get more signal than just using the final score.
What is DistilBERT?
It is a distilled version of the general purpose BERT model. It has only 6 layers instead of 12, it shared the same vocabulary.
Even though it is smaller, it has the 97% accruacy of the normal BERT model.
How to do the distillation in Information Retrieval?
First, we need to train the Teacher normally, using triplets (q, p+ and p-).
Then, the teacher scored are calculated for all passages and all queries (do that once and store it somewhere)
In the end, we train the Student model by using Teacher’s stored inference scores. The loss of the student is a difference between student’s output and teacher’s score.
What are some of the loss-es in the knowledge distillation?
Margin-MSE Loss:
We don’t care about the exact result, we don’t want the student to retrieve the same as teacher, but the margin between relevant and non-relevant documents should be similar.
L(Q, P+, P-) = MSE(Ms(Q,P+) - Ms(Q, P-,), Mt(Q,P+) - Mt(Q, P-,))
This loss is model-agnostic: model architecture doesn’t matter=teacher and student models can be very different, we can precompute the teacher scores once
KL-divergance (Kullback-Leiber): does not compare the results, but it compares the distributions. At the end, we can use softmax to get the probability distribution for all documents (given a query) and we want student distribution to be as close as possible to the teacher distribution.
Why dense retrieval models struggle on zero-shot?
We want good zero-shot performance for these models, so people can just plug them in and they work. This doesn’t sually work because:
- Quirks: training data might be very specific, or some attributes specific to this dataset
- Pool bias: Many collections are biased towards BM25 metric. When giving annotators the documents, you usually filter using BM25 to get top 1000 documents first. This introduces bias
- Generalization: models don’t generalize to to different queries, other than the ones from the training (overfitting)