Example for Fine Tuning CLIP or BLIP2 for VQA

VedaantJain · August 15, 2023, 1:06am

Hi,
I wanted to fine tune CLIP and BLIP2 for a VQA task on custom dataset, but I was unsure how to do it. Are there any examples for fine tuning CLIP and BLIP2 for VQA?

Thank you

dino-chiio · December 5, 2023, 6:24am

I have implemented a repo here. I hope this can help.

swtb · February 16, 2024, 12:46pm

This uses BLIP rather than BLIP2, no? Any pointers on BLIP-2? The architecture is slightly different

aditya-nisal · April 11, 2024, 4:54pm

Hello @swtb , did you find anything for finetuning vqa or blip2 or have implemented anything? Any help will be really appreciated.

swtb · April 12, 2024, 7:22am

Yes though I moved on from it. You can find alot of clues from the inputs of the model and outputs of its forward pass.

I’ll be at my pc later, will attach a code snippet from my training loop

nielsr · April 12, 2024, 11:31am

Tutorials for fine-tuning BLIP-2 are linked here: Transformers-Tutorials/BLIP-2 at master · NielsRogge/Transformers-Tutorials · GitHub. These include notebooks for both full fine-tuning (updating all parameters) as well as PEFT (parameter efficient fine-tuning using LoRa).

wildsky · May 24, 2024, 11:46am

Hi, have you implemented the use of lora to fine tune blip2 on top of vqa tasks?

nielsr · May 30, 2024, 7:52am

I’d recommend checking out the repo linked above, and then you just need to wrap the BLIP-2 model using PEFT:

from transformers import Blip2ForConditionalGeneration
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True)

lora_config = LoraConfig(
            r=8,
            lora_alpha=8,
            lora_dropout=0.1,
            target_modules="...",
            init_lora_weights="gaussian",
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Then proceed as usual. The target_modules are model-specific.

NirmalML · October 2, 2024, 2:57pm

Can anyone please share the code for Finetuning BLIP-2 for VQA?? (PEFT)

sashika · October 14, 2024, 8:51am

@NirmalML : were you able to find a suitable code?

NirmalML · October 14, 2024, 12:54pm

Hello @sashika, no not yet. I have a code but I’m not sure if it is correct

NirmalML · October 14, 2024, 1:01pm

I need it urgently, so it would be really kind and helpful if anyone helps me in this

John6666 · October 14, 2024, 1:05pm

@not-lain He seems to be in a hurry. Can you think of anyone who might know this code?

code for Finetuning BLIP-2 for VQA?? (PEFT)

sashika · October 15, 2024, 6:18am

Could you please share your code?

VedaantJain · October 15, 2024, 6:47am

I ended up using BLIP not BLIP2, so:
my complete script for BLIP is here:

github.com

kreimanlab/HumorDB/blob/main/benchmarks/binary_classification/vqa/fine_tune.py

import numpy as np


# Critical imports
import os
import numpy as np
import pandas as pd
from PIL import Image
import random
import torch
import torchvision
import torchvision.models as models
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import torch.optim as optim
from torch.optim import lr_scheduler
import gc
import matplotlib.pyplot as plt
from typing import Dict, List, Optional, Set, Tuple, Union
from transformers import AutoImageProcessor, BlipForQuestionAnswering, BlipProcessor

This file has been truncated. show original

If looking for a basic snippet,

encoding = self.processor(images=image, text=question, padding="max_length", truncation=True, return_tensors="pt")
labels = self.processor(text=answer, return_tensors="pt").input_ids
encoding["labels"] = labels
batch = encoding
batch = {k: v.cuda() for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss

if you are interested in the loss used by the model by default, it is on line 899 on transformers/src/transformers/models/blip/modeling_blip_text.py at main · huggingface/transformers · GitHub

putting the small snippet here:

if labels is not None:
            # we are doing next-token prediction; shift prediction scores and input ids by one
            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
            labels = labels[:, 1:].contiguous().to(shifted_prediction_scores.device)
            loss_fct = CrossEntropyLoss(reduction=reduction, label_smoothing=self.label_smoothing)
            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
            if reduction == "none":
                lm_loss = lm_loss.view(prediction_scores.size(0), -1).sum(1)

VedaantJain · October 15, 2024, 6:50am

The above basically calculates over all predictions so every token after the first one not just the answer
but empirically that is fine and it helps if one wants to ask specific kinds of questions as the model is better tuned for the question too.

bharatgandhi · February 20, 2025, 12:12am

Hey @swtb , can you attach the code?

VedaantJain · February 20, 2025, 12:52am

Hey,
I have attached my code in the discussion already if it helps.

system · February 20, 2025, 12:52pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Use finetuned model for feature extraction 🤗Transformers	0	61	July 23, 2024
Finetune BLIP on customer dataset #20893 Models	22	7385	September 16, 2024
Can you help me interepret the results of my hyperparameter sweep for fine-tuning BLIP2-2.7? Models	0	49	October 22, 2024
Any one have an idea on how large should the dataset to be to fine-tune BLIP2 model? Models	0	153	November 16, 2024
How do I finetune Blip2 model on a custom dataset? Intermediate	1	569	October 1, 2024

Example for Fine Tuning CLIP or BLIP2 for VQA

Related topics