Fine tuned BLIP model is somehow 10x slower during inference

Bill2462 · May 28, 2023, 4:11pm

Hello,

I’m trying to fine tune BLIP model on a custom dataset. I am using the following code for fine tuning (parts for loading config omitted for clarity):

    processor = BlipProcessor.from_pretrained(args.blip_path)
    model = BlipForConditionalGeneration.from_pretrained(args.blip_path).train()

    dataset = dataset = (
        wds.WebDataset(args.dataset_path)
        .shuffle(500)
        .decode("pil")
        .to_tuple("jpg", "string")
        .map(Processor(processor))
    )

    training_args = TrainingArguments(output_dir=args.ckpt_dir,
                                      logging_dir=args.log_dir,
                                      **config)
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
    )

    trainer.train()

Initial model is from https://huggingface.co/Salesforce/blip-image-captioning-large

Pytorch version used on a training server is '2.0.1+cu118'. transformers version is '4.29.2'.
Model was trained on a single H100 GPU from lambda labs.

The problem manifested itself when I was trying to run the checkpoint on my machine. My laptop has RTX 3060 mobile GPU. Pytorch version is ‘2.0.1’ and transformers version is ‘4.29.2’

Since checkpoint saved by pytorch trainer did not contain tokenizer.json and other .json configuration files I copied them from original model. Then I tried to run some images through the model.

Here is my inference code (code for loading data and processing results ommited for clarity):

     self.processor = BlipProcessor.from_pretrained(model_path)
     self.model = BlipForConditionalGeneration.from_pretrained(model_path).eval()

      if device != "cpu":
          self.model = self.model.to(device, torch.float16)

      for img_batch in images:
            x = self.processor(images=img_batch, text=[prompt]*len(img_batch), return_tensors="pt")
            
            if self.device != "cpu":
                x = x.to(self.device, torch.float16)

            generated_ids = self.model.generate(**x, max_new_tokens=max_new_tokens)
            txts = self.processor.batch_decode(generated_ids, skip_special_tokens=True)

Batch size is 16. When I’m running my inference code with checkpoint from Salesforce/blip-image-captioning-large I’m getting on average 1.33s/it. However when I run the same code with my checkpoint from fine tuning I’m getting on average 10.66s/it. This is 10x worse performance. Both models take the same amount of VRAM and in both cases GPU is running at close to 100%.

What is going on and how to fix it? I’m quite new to deep learning and this is a big mystery to me. Both checkpoints have the same size so it’s probably not parameter count. Both models run on float16 so it’s probably not that. Maybe model from Salesforce was somehow optimized? I found some fine tuning code (https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb#scrollTo=6cCVhsmJxxjH) that didn’t use trainer from transformers and it didn’t have any further optimization of trained model.

Bill2462 · May 29, 2023, 8:02am

I realized what is going on. It turns out the problem lays in the fact that I was testing early checkpoint. Model wasn’t really good at this point and was adding random repetitions to the end. As a result sentences were much longer so they took a lot longer to generate. Today I tested better checkpoint and performance improved. I also tweaked repetition penalty and temperature and now problem has disappeared entirely.

Topic		Replies	Views
I would like to finetune the blip model on ROCO data set for image captioning of chest x-rays 🤗Transformers	0	588	February 12, 2023
Finetune BLIP on customer dataset #20893 Models	22	7387	September 16, 2024
Fine-tuning Blip3-o with Runpod Beginners	4	31	June 17, 2025
Solution for Fine Tuning the Blip Model 🤗Transformers	0	96	December 13, 2024
Any one have an idea on how large should the dataset to be to fine-tune BLIP2 model? Models	0	153	November 16, 2024

Fine tuned BLIP model is somehow 10x slower during inference

Related topics