Using Cross-Encoders to calculate similarities among documents

AndreGodinho · December 2, 2020, 10:52am

Hello everyone! I have some questions for fine-tuning a Cross-Encoder for a passage/document ranking task.

In Bi-Encoders (like DPR) we can use Negative Log-Likelihood (NLL) in training, where the similarities are calculated by the dot product among the vectors of the question and documents.

I wonder if we can apply a similar strategy to Cross-Encoders. In other words, we would concatenate a pair of question-passage and do a forward pass in BERT. In the end, we obtain a similarity value and by doing this across several pairs in a training instance (each instance has a question, a positive document, and N negative documents) we would obtain N+1 similarities and then apply the (NLL).

I have seen a similar approach: training SBERT Cross-Encoders. Here they use a model like Hugging face BertForSequenceClassification, set num_labels = 1, and do a forward pass with a pair of question and document.
With this setting, the model is doing regression, where the logits are calculated with a linear with an output of size = num_labels = 1. After that, they apply BCEWithLogitsLoss and perform backpropagation.

To apply the negative log-likelihood I need some sort of similarity values among 0 and 1 between a question and a document, using BertForSequenceClassification in regression seems to be a step in this direction.

Should I 1) replace BCEWithLogitsLoss with just a sigmoid function to pass the logits returned by BertForSequenceClassification to map them to a [0,1] similarity, 2) do this for all documents, 3) compute NLL, and 4) backpropagate?
Is there another way to keep their setting and then compute NLL across all documents and backpropagate on NLL? It seems that I will lose the similarity value if I apply the BCEWithLogitsLoss

Cheers!

Jung · December 3, 2020, 6:56am

Hi, just want to note that, beside bi-encoders modules you mentioned, DPR also have a “reader” module which concat “question” and “passage” together and then do cross-attention like you said. Details in the paper Section 6.

Code example for DPRReader (you can see the concat input): https://huggingface.co/transformers/model_doc/dpr.html#dprreader

AndreGodinho · December 3, 2020, 8:17am

Hi Jung, thank you for your reply.

Indeed DPR uses cross-attention in the reader, however, the use case here is a bit different. In other words, at the end of the cross-encoder, I need a similarity value of some sort, which is not the case at the Reader.

I have some updates. After more investigation, the authors of Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring give a high-level overview of how they obtain the similarity value with the Cross-Encoder in the section 4.3. Furthermore, there are more insights on the ParlAI transformers repo, where they say: transformer/crossencoder: A retrieval-based agent that jointly encodes a context and candidate sequence in a single BERT-based Transformer, with a final linear layer used to compute a score. A candidate is chosen via the highest-scoring encoding.

It seems that I can use this approach, we can obtain scores (not normalized) among all the pairs and then use a loss function like NLL or cross-entropy.

Jung · December 3, 2020, 2:01pm

Hey Andre, thanks for the referenced papers, they all look interesting!!

Topic		Replies	Views
(Auto) Sequence Classification model with triplets / contrastive loss Models	1	724	September 20, 2023
Open-sourcing better cross-encoders for STILTS and better IR? Intermediate	2	901	October 9, 2021
Ideas for better cross-corpus similarity scoring 🤗Transformers	0	161	July 16, 2023
Sentences' embeddings from BERT cross-encoder 🤗Transformers	0	274	December 22, 2022
Need Help with Reliable Cross-Sentence Coreference Resolution for Document Summarization Intermediate	0	124	October 26, 2024

Using Cross-Encoders to calculate similarities among documents

Related topics