Hello everyone! I have some questions for fine-tuning a Cross-Encoder for a passage/document ranking task.
In Bi-Encoders (like DPR) we can use Negative Log-Likelihood (NLL) in training, where the similarities are calculated by the dot product among the vectors of the question and documents.
I wonder if we can apply a similar strategy to Cross-Encoders. In other words, we would concatenate a pair of question-passage and do a forward pass in BERT. In the end, we obtain a similarity value and by doing this across several pairs in a training instance (each instance has a question, a positive document, and N negative documents) we would obtain N+1 similarities and then apply the (NLL).
I have seen a similar approach: training SBERT Cross-Encoders. Here they use a model like Hugging face BertForSequenceClassification, set num_labels = 1, and do a forward pass with a pair of question and document.
With this setting, the model is doing regression, where the logits are calculated with a linear with an output of size = num_labels = 1. After that, they apply BCEWithLogitsLoss and perform backpropagation.
To apply the negative log-likelihood I need some sort of similarity values among 0 and 1 between a question and a document, using BertForSequenceClassification in regression seems to be a step in this direction.
- Should I 1) replace BCEWithLogitsLoss with just a sigmoid function to pass the logits returned by BertForSequenceClassification to map them to a [0,1] similarity, 2) do this for all documents, 3) compute NLL, and 4) backpropagate?
- Is there another way to keep their setting and then compute NLL across all documents and backpropagate on NLL? It seems that I will lose the similarity value if I apply the BCEWithLogitsLoss
Cheers!