Example for Fine Tuning CLIP or BLIP2 for VQA

I wanted to fine tune CLIP and BLIP2 for a VQA task on custom dataset, but I was unsure how to do it. Are there any examples for fine tuning CLIP and BLIP2 for VQA?

I have implemented a repo here. I hope this can help.

This uses BLIP rather than BLIP2, no? Any pointers on BLIP-2? The architecture is slightly different

Hello @swtb , did you find anything for finetuning vqa or blip2 or have implemented anything? Any help will be really appreciated.

Yes though I moved on from it. You can find alot of clues from the inputs of the model and outputs of its forward pass.

I’ll be at my pc later, will attach a code snippet from my training loop

Tutorials for fine-tuning BLIP-2 are linked here: Transformers-Tutorials/BLIP-2 at master · NielsRogge/Transformers-Tutorials · GitHub. These include notebooks for both full fine-tuning (updating all parameters) as well as PEFT (parameter efficient fine-tuning using LoRa).

