Example for Fine Tuning CLIP or BLIP2 for VQA

Hi,
I wanted to fine tune CLIP and BLIP2 for a VQA task on custom dataset, but I was unsure how to do it. Are there any examples for fine tuning CLIP and BLIP2 for VQA?

Thank you

1 Like

I have implemented a repo here. I hope this can help.

This uses BLIP rather than BLIP2, no? Any pointers on BLIP-2? The architecture is slightly different

Hello @swtb , did you find anything for finetuning vqa or blip2 or have implemented anything? Any help will be really appreciated.

Yes though I moved on from it. You can find alot of clues from the inputs of the model and outputs of its forward pass.

I’ll be at my pc later, will attach a code snippet from my training loop

Tutorials for fine-tuning BLIP-2 are linked here: Transformers-Tutorials/BLIP-2 at master · NielsRogge/Transformers-Tutorials · GitHub. These include notebooks for both full fine-tuning (updating all parameters) as well as PEFT (parameter efficient fine-tuning using LoRa).

1 Like

Hi, have you implemented the use of lora to fine tune blip2 on top of vqa tasks?

I’d recommend checking out the repo linked above, and then you just need to wrap the BLIP-2 model using PEFT:

from transformers import Blip2ForConditionalGeneration
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True)

lora_config = LoraConfig(
            r=8,
            lora_alpha=8,
            lora_dropout=0.1,
            target_modules="...",
            init_lora_weights="gaussian",
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Then proceed as usual. The target_modules are model-specific.

2 Likes

Can anyone please share the code for Finetuning BLIP-2 for VQA?? (PEFT)

@NirmalML : were you able to find a suitable code?

Hello @sashika, no not yet. I have a code but I’m not sure if it is correct

I need it urgently, so it would be really kind and helpful if anyone helps me in this

@not-lain He seems to be in a hurry. Can you think of anyone who might know this code?

code for Finetuning BLIP-2 for VQA?? (PEFT)

Could you please share your code?

I ended up using BLIP not BLIP2, so:
my complete script for BLIP is here:

If looking for a basic snippet,

encoding = self.processor(images=image, text=question, padding="max_length", truncation=True, return_tensors="pt")
labels = self.processor(text=answer, return_tensors="pt").input_ids
encoding["labels"] = labels
batch = encoding
batch = {k: v.cuda() for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss

if you are interested in the loss used by the model by default, it is on line 899 on transformers/src/transformers/models/blip/modeling_blip_text.py at main · huggingface/transformers · GitHub

putting the small snippet here:

if labels is not None:
            # we are doing next-token prediction; shift prediction scores and input ids by one
            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
            labels = labels[:, 1:].contiguous().to(shifted_prediction_scores.device)
            loss_fct = CrossEntropyLoss(reduction=reduction, label_smoothing=self.label_smoothing)
            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
            if reduction == "none":
                lm_loss = lm_loss.view(prediction_scores.size(0), -1).sum(1)
1 Like

The above basically calculates over all predictions so every token after the first one not just the answer
but empirically that is fine and it helps if one wants to ask specific kinds of questions as the model is better tuned for the question too.

1 Like