I am trying to fine-tune Kosmos-2 model on DocLayNet dataset. It has image, bounding box and its label. I want to create token classifier. I have converted dataset into Kosmos-2 required string. I am getting below error:
ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values,image_embeds,projection_attentions,vision_model_output. For reference, the inputs it received are pixel_values,input_ids,attention_mask,image_embeds_position_mask.
I found that you mentioned that this issue is related to labels on Github issue.
Do you this below loss function will resolve my issue:
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
outputs = model(**inputs)
logits = outputs.logits
# Get predicted logits for the next token
predicted_logits = logits[:, :-1].contiguous()
# Flatten logits and labels for loss computation
logits_flat = predicted_logits.view(-1, predicted_logits.size(-1))
labels_flat = inputs["input_ids"][:, 1:].contiguous().view(-1) # Assuming input_ids contains the input sequence
# Calculate the cross-entropy loss
loss = F.cross_entropy(logits_flat, labels_flat)
return (loss, outputs) if return_outputs else loss
Really appreciate your response. Please give some direction, thank you.
Data Collator method is not yet defined in transformers for vision2seq - Kosmos2ForConditionalGeneration. That’s why labels are empty while training.
When I checked logits_flat and labels_flat values then both were very different. labels_flat was shifted right by 1 and length was reduced by 1. I added tokenizer.pad_token_id for testing.
Hello everyone, I opened an Issue on GitHub about this problem and I tried to implement the Supervised Training in the Kosmos2Processor by passing the labels and tokenizing them.
It seems that the Trainer from Transformers is doing its job and effectively training the Language Model, but I found problems with the images. It looks like the Vision Encoder is having some troubles, maybe it’s not being trained or maybe it’s not saved, I can’t figure out what is happening.
I leave here a link to a Notebook where I show what I am doing, maybe someone can take a look and help me out. In the future, it could be helpful to implement this strategy inside the Transformers library so others can further fine-tune a Kosmos model on their own data.
@cdh You work involves many changes to the original code to fit your task. I will need more time to take a look and not sure if I can really come up some insights. In the meantime, let’s see if @Mit1208 could arrive to a working example.
To be honest, I wasn’t too clear on that. Can you give me code snippet to create labels if you can. If not give me some sort of python code which I can refer and I will taking from there.
This model works on next token prediction so I thought to use Data Collator but there is None that supports multimodel as of now.
I really appreciate your help and it’s always very useful.
And, I think if my code will then @cdh code will work too because we both are trying to achieve almost similar things in different way.
but using class CustomTrainer. This is not necessary, and you can just preapre labels in your dataset. So it will be received by the model under the hood.