Hi,
I’m currently trying to calculate gradients for several VLMs. My text data contains a question and an answer. For Paligemma, I give the question to the processor via the text argument and the answer via the suffix argument like this:
model_inputs = processor(text=question, images=image, suffix=answer,... )
And if I do this, the logits of the question are correctly masked when computing the cross-entropy loss in the Paligemma forward function. So the loss only wants the model to output the correct answer but does not compute the next token prediction loss for the tokens from the question.
For LLaVA, I use a chat template and then tokenize this:
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": question},
{"type": "image"},
],
},
{
"role": "assistant",
"content": [
{"type": "text", "text": answer},
],
},
]
llava_text = processor.apply_chat_template(conversation)
model_inputs = processor(text=llava_text, images=image,...)
Now when debugging the loss calculation inside processing_llava.py, I noticed that the locations corresponding to the question are not masked out. So the loss will also force the model to correctly complete the question which is not necessary for my use case.
Is there a standard way to mask the question part here and only calculate the loss over the answer?
Thank you for your help!