Kosmos-2 Fine tuning

I will add the below code to handle padding. Let me know if it’s a wrong way.

labels = inputs['input_ids'].clone()
labels[inputs['attention_mask'] == 0] = -100
inputs['labels'] = labels

I think the padding token id is 1 instead of 0. You can see that in Kosmos2TextConfig but as well as by checking the example you have in the notebook.

Other than this, I think it it!

1 Like

I was using attention_mask so it was working with 0 but for input_ids it’s 1. I checked config file, padding token id is 1.

new code would look like this:

labels = inputs['input_ids'].clone()
labels[inputs['input_ids'] == 1] = -100
inputs['labels'] = labels

@Mit1208 Did you successfully enable FP16 at your training notebook?

@yuerlong fp16=True in training argument was giving me an error so I removed that argument.

FYI - Issue on Kosmos-2 model training on new dataset

@Mit1208 I just changed add a line before 1152 at modeling_kosmos2.py to temporarily enable fp16

inputs_embeds = inputs_embeds.to(image_embeds.dtype)

1 Like

Hello everyone @Mit1208 and @ydshieh

I am still eager to see if we can adapt Kosmos2 for my task :smiley:

Did you get any working examples yet? I see the Colab Notebook you shared, but I don’t know if it’s updated. Any news?


hi @cdh

I think I have everything and running fine thanks to @ydshieh. I couldn’t train model just because of GPU resource but I can share colab which has everything. I just need to adjust few things. I will try to share in few hours.

Hi @cdh

Here is my final code which shows kosmos-2 fine-tuning. because of the GPU limitation I updated the layers information. You can remove the Config parameter while loading the model and you are good to go.

My code:

Finetuning code

Hi, thx a lot for the impressive work. I am also trying to finetune KOSMOS-2, and I check your colab note, getting one question is that the code is to train a KOSMOS-2 from a scratch with customized dataset, or finetune the model with LoRA or other fintune methods.

Thanks a lot

Nice, thank you very much.

Could you please briefly explain what was the problem and the solution? It’s hard to go through all your code without the knowledge you have now. In this way, I could understand how to apply that solution to my problem. Even pointers to the code saying “here I changed this into this” or “here I added this line”, etc.

Thanks again!

For anyone other than @Mit1208 (as he knows what’s going on now), the following 2 replies are the 2 changes necessary

@cdh Unfortunately, I won’t have the bandwidth to dive into the notebook you provided - especially it contains a lot of custom code and customization.

Regarding the question about labels, see the above 2 links.

For general training with your custom dataset/model, I would recommend:

  • try a simple dataset (with just a few text/image paris), train the model (probably with the pretrained one, but you can of course also try from scratch) on that tiny dataset, and see if you can see the loss decreasing, get the model to give the desired generation (on the trained examples)

  • Always to to look the examples (before processing, and after being processed by the Kosmos2 processor) , make sure you understand the output of the processor (which are the inputs to the model)

  • Once you get familar with the above, think of what would/should be adjusted for your custom dataset and model

Thank you, I already tried to implement this behavior in my code with this notation:

labels_ids = [[-100] * indexes[i] + labels_id[indexes[i]:] for i, labels_id in enumerate(labels_ids_tmp)]
prompt_ids = [prompt_id[:indexes[i]] + [1] * (len(input_ids[0])-indexes[i]) for i, prompt_id in enumerate(prompt_ids_tmp)]

As far as I know, this should be the same as your solution, am I right? But here I divided my text into prompt (the input) and label (the output)

Could you remind me what the issue you have? Is it still about

ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values,image_embeds,projection_attentions,vision_model_output. For reference, the inputs it received are pixel_values,input_ids,attention_mask,image_embeds_position_mask.
  0%|          | 0/10 [00:03<?, ?it/s] 

that you opened


@cdh, I will add comments and reasoning in my code so it would be easier to follow (give me some time).
I made @ydshieh’s mentioned changes in the code so it should work for you.

You’re right, sorry for not remembering to you my issue.

I managed to add the -100 ids to the input with the above code, but when I trained the model the output was not coherent. For a given <image, prompt> pair I get a response A, if I change the image maintaining the same prompt, the response is the same identical A as before with a completely wrong bounding box.
I am working with images from an automation house and I want to model to produce the bounding boxes (among other stuff) of objects from the image. If I change the image, the model should produce different bounding boxes, but it turns out that it produces the same response A where the bounding box points on the wall, and there’s nothing useful there.

My intuition here is that during training the model started at some point to disregard the images and focused on the text part only. So I was wondering if maybe my way of adding the -100 ids is the same as yours.

I will try your method and see if there’s any improvement, but it will take a while to train.

Thanks again.