Using Cross-Encoders to calculate similarities among documents

Hello everyone! I have some questions for fine-tuning a Cross-Encoder for a passage/document ranking task.

In Bi-Encoders (like DPR) we can use Negative Log-Likelihood (NLL) in training, where the similarities are calculated by the dot product among the vectors of the question and documents.

I wonder if we can apply a similar strategy to Cross-Encoders. In other words, we would concatenate a pair of question-passage and do a forward pass in BERT. In the end, we obtain a similarity value and by doing this across several pairs in a training instance (each instance has a question, a positive document, and N negative documents) we would obtain N+1 similarities and then apply the (NLL).

I have seen a similar approach: training SBERT Cross-Encoders. Here they use a model like Hugging face BertForSequenceClassification, set num_labels = 1, and do a forward pass with a pair of question and document.
With this setting, the model is doing regression, where the logits are calculated with a linear with an output of size = num_labels = 1. After that, they apply BCEWithLogitsLoss and perform backpropagation.

To apply the negative log-likelihood I need some sort of similarity values among 0 and 1 between a question and a document, using BertForSequenceClassification in regression seems to be a step in this direction.

  1. Should I 1) replace BCEWithLogitsLoss with just a sigmoid function to pass the logits returned by BertForSequenceClassification to map them to a [0,1] similarity, 2) do this for all documents, 3) compute NLL, and 4) backpropagate?
  2. Is there another way to keep their setting and then compute NLL across all documents and backpropagate on NLL? It seems that I will lose the similarity value if I apply the BCEWithLogitsLoss


1 Like

Hi, just want to note that, beside bi-encoders modules you mentioned, DPR also have a “reader” module which concat “question” and “passage” together and then do cross-attention like you said. Details in the paper Section 6.

Code example for DPRReader (you can see the concat input):

Hi Jung, thank you for your reply.

Indeed DPR uses cross-attention in the reader, however, the use case here is a bit different. In other words, at the end of the cross-encoder, I need a similarity value of some sort, which is not the case at the Reader.

I have some updates. After more investigation, the authors of Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring give a high-level overview of how they obtain the similarity value with the Cross-Encoder in the section 4.3. Furthermore, there are more insights on the ParlAI transformers repo, where they say: transformer/crossencoder: A retrieval-based agent that jointly encodes a context and candidate sequence in a single BERT-based Transformer, with a final linear layer used to compute a score. A candidate is chosen via the highest-scoring encoding.

It seems that I can use this approach, we can obtain scores (not normalized) among all the pairs and then use a loss function like NLL or cross-entropy.

1 Like

Hey Andre, thanks for the referenced papers, they all look interesting!!