Example for Fine Tuning CLIP or BLIP2 for VQA

Hi,
I wanted to fine tune CLIP and BLIP2 for a VQA task on custom dataset, but I was unsure how to do it. Are there any examples for fine tuning CLIP and BLIP2 for VQA?

Thank you

1 Like

I have implemented a repo here. I hope this can help.

This uses BLIP rather than BLIP2, no? Any pointers on BLIP-2? The architecture is slightly different

Hello @swtb , did you find anything for finetuning vqa or blip2 or have implemented anything? Any help will be really appreciated.

Yes though I moved on from it. You can find alot of clues from the inputs of the model and outputs of its forward pass.

I’ll be at my pc later, will attach a code snippet from my training loop

Tutorials for fine-tuning BLIP-2 are linked here: Transformers-Tutorials/BLIP-2 at master · NielsRogge/Transformers-Tutorials · GitHub. These include notebooks for both full fine-tuning (updating all parameters) as well as PEFT (parameter efficient fine-tuning using LoRa).

1 Like

Hi, have you implemented the use of lora to fine tune blip2 on top of vqa tasks?

I’d recommend checking out the repo linked above, and then you just need to wrap the BLIP-2 model using PEFT:

from transformers import Blip2ForConditionalGeneration
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True)

lora_config = LoraConfig(
            r=8,
            lora_alpha=8,
            lora_dropout=0.1,
            target_modules="...",
            init_lora_weights="gaussian",
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Then proceed as usual. The target_modules are model-specific.

1 Like