Kosmos-2 Fine tuning

Hi @ydshieh,

I am trying to fine-tune Kosmos-2 model on DocLayNet dataset. It has image, bounding box and its label. I want to create token classifier. I have converted dataset into Kosmos-2 required string. I am getting below error:

ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values,image_embeds,projection_attentions,vision_model_output. For reference, the inputs it received are pixel_values,input_ids,attention_mask,image_embeds_position_mask.

I found that you mentioned that this issue is related to labels on Github issue.

Do you this below loss function will resolve my issue:

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):

        outputs = model(**inputs)
        logits = outputs.logits

        # Get predicted logits for the next token
        predicted_logits = logits[:, :-1].contiguous()

        # Flatten logits and labels for loss computation
        logits_flat = predicted_logits.view(-1, predicted_logits.size(-1))
        labels_flat = inputs["input_ids"][:, 1:].contiguous().view(-1)  # Assuming input_ids contains the input sequence

        # Calculate the cross-entropy loss
        loss = F.cross_entropy(logits_flat, labels_flat)

        return (loss, outputs) if return_outputs else loss

Really appreciate your response. Please give some direction, thank you.

I found two things:

  1. Data Collator method is not yet defined in transformers for vision2seq - Kosmos2ForConditionalGeneration. That’s why labels are empty while training.

  2. When I checked logits_flat and labels_flat values then both were very different. labels_flat was shifted right by 1 and length was reduced by 1. I added tokenizer.pad_token_id for testing.

I am not sure is it correct or not. Any thoughts?

Hello everyone, I opened an Issue on GitHub about this problem and I tried to implement the Supervised Training in the Kosmos2Processor by passing the labels and tokenizing them.

It seems that the Trainer from Transformers is doing its job and effectively training the Language Model, but I found problems with the images. It looks like the Vision Encoder is having some troubles, maybe it’s not being trained or maybe it’s not saved, I can’t figure out what is happening.

I leave here a link to a Notebook where I show what I am doing, maybe someone can take a look and help me out. In the future, it could be helpful to implement this strategy inside the Transformers library so others can further fine-tune a Kosmos model on their own data.

Hi, I will take a look (only) next week, sorry for this long delay

1 Like

based on Finetune BLIP on customer dataset #20893 - #2 by dxlong2000. I set labels as input_ids in model

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):

        # outputs = model(**inputs)
        outputs = model(
            input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"],
            attention_mask=inputs["attention_mask"], image_embeds_position_mask=inputs["image_embeds_position_mask"],
            labels = inputs["input_ids"]
        logits = outputs.logits
        # Ensure that 'eval_loss' is present in the metrics dictionary
        metrics = {'eval_loss': outputs.loss.item()}
        return (outputs.loss, outputs) if return_outputs else outputs.loss

hi @ydshieh,

I hope you are well. Any quick suggestions, if you got a chance to look at the issue?

1 Like

Hi @Mit1208 It would be helpful if I get the access to a code snippet (or a notebook) that has everything so I can run directly to see the issue.

I see @cdh has something on github issue but the following is mentioned

I cannot add any example of the data and thus the code will not run correctly

If the data is sensitive and can’t be shared, you can still make a fake dummy dataset (even just 4 or tens of examples)

I am more than happy to take a look asap once I get such an access :pray:

(and sorry for such long delay)

Thanks @cdh I got the dataset

Hi @ydshieh,

It’s okay. Thanks for your time.

Here is my code in colab Google Colab

1 Like

with GPU (T4), I always get GPU OOM
with CPU, I get CPU RAM OOM

@Mit1208 could you change a bit of the configuration(s) to use smaller resource so we can arrive and see the relevant issue :pray:

(you can make a copy of the original notebook and make the changes - if you don’t want to touch the original code)

1 Like

Hi @ydshieh,

I reduce the dataset as much as I can. I have just 4 records for testing only.

Removed all unused code. I hope this will be helpful. Thanks.

1 Like

@Mit1208 I see you defined

class CustomTrainer(Trainer)
and inside it you have

labels = inputs[“input_ids”]

comment 1: why not just prepare (add labels in inputs) after the line

inputs = processor

comment 2: as I mentioned, we can’t just do

labels = inputs[“input_ids”])

as we have to change the padding place to -100.

I don’t think this is done in your notebook.

Let me know if there is something unlcear above :pray:

Happy to help to show how to do it if you need help

@cdh You work involves many changes to the original code to fit your task. I will need more time to take a look and not sure if I can really come up some insights. In the meantime, let’s see if @Mit1208 could arrive to a working example.

@ydshieh, I added labels in model itself because I found one discussion around it e.g. Finetune BLIP on customer dataset #20893 - #2 by dxlong2000.

To be honest, I wasn’t too clear on that. Can you give me code snippet to create labels if you can. If not give me some sort of python code which I can refer and I will taking from there.

This model works on next token prediction so I thought to use Data Collator but there is None that supports multimodel as of now.

I really appreciate your help and it’s always very useful.

And, I think if my code will then @cdh code will work too because we both are trying to achieve almost similar things in different way.

I will provide something :slight_smile:

I added labels in model itself

Not really if I understand correctly. You just do

        outputs = model(
            input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"],
            attention_mask=inputs["attention_mask"], image_embeds_position_mask=inputs["image_embeds_position_mask"],
            labels = inputs["input_ids"])

but using class CustomTrainer. This is not necessary, and you can just preapre labels in your dataset. So it will be received by the model under the hood.

It’s fine, let me show you

1 Like


We just need the line

inputs[“labels”] = inputs[“input_ids”]

as in

(note, I haven’t addressed yet the padding and -100 in the labels)

from datasets import Dataset

inputs = processor(images = train_df['image'].to_list(), text = train_df['text'].to_list(), bboxes = train_df['float_val'].to_list(), padding=True, truncation= True, return_tensors="pt").to(device)
inputs["labels"] = inputs["input_ids"]

dataset = Dataset.from_dict(inputs)
train_test_split = dataset.train_test_split(test_size=0.3)
1 Like

Oh, I see you created new column labels and using default_data_collator will check loss with the next token. Correct me if my understanding is wrong.

I think this is a good start, I will also try to figure out padding and -100.

Thanks @ydshieh

well, I don’t add default_data_collator: it’s on your notebook already (I don’t know if it’s necessary however :grin: )

1 Like