I was reading the code for RAG (Retrieval Augmented Generation) on transformers github.
I wanted to know how gradients are backpropagated till query encoder model. I wrote an answer for it.
But then I wondered… how come loss of retrieval model (query encoder) is calculated by simply taking the softmax over the doc_scores? Here
I get that they are adding the softmax over doc_scores to the seq_logits loss in order backpropagate the gradients. But I am unable to understand the intuition behind doing a softmax over doc_scores? and why that prob distribution is added to seq_logits loss? Both are different