Fine tuned BLIP model is somehow 10x slower during inference

Hello,

I’m trying to fine tune BLIP model on a custom dataset. I am using the following code for fine tuning (parts for loading config omitted for clarity):

    processor = BlipProcessor.from_pretrained(args.blip_path)
    model = BlipForConditionalGeneration.from_pretrained(args.blip_path).train()

    dataset = dataset = (
        wds.WebDataset(args.dataset_path)
        .shuffle(500)
        .decode("pil")
        .to_tuple("jpg", "string")
        .map(Processor(processor))
    )

    training_args = TrainingArguments(output_dir=args.ckpt_dir,
                                      logging_dir=args.log_dir,
                                      **config)
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
    )

    trainer.train()

Initial model is from https://huggingface.co/Salesforce/blip-image-captioning-large

Pytorch version used on a training server is '2.0.1+cu118'. transformers version is '4.29.2'.
Model was trained on a single H100 GPU from lambda labs.

The problem manifested itself when I was trying to run the checkpoint on my machine. My laptop has RTX 3060 mobile GPU. Pytorch version is ‘2.0.1’ and transformers version is ‘4.29.2’

Since checkpoint saved by pytorch trainer did not contain tokenizer.json and other .json configuration files I copied them from original model. Then I tried to run some images through the model.

Here is my inference code (code for loading data and processing results ommited for clarity):

     self.processor = BlipProcessor.from_pretrained(model_path)
     self.model = BlipForConditionalGeneration.from_pretrained(model_path).eval()

      if device != "cpu":
          self.model = self.model.to(device, torch.float16)

      for img_batch in images:
            x = self.processor(images=img_batch, text=[prompt]*len(img_batch), return_tensors="pt")
            
            if self.device != "cpu":
                x = x.to(self.device, torch.float16)

            generated_ids = self.model.generate(**x, max_new_tokens=max_new_tokens)
            txts = self.processor.batch_decode(generated_ids, skip_special_tokens=True)

Batch size is 16. When I’m running my inference code with checkpoint from Salesforce/blip-image-captioning-large I’m getting on average 1.33s/it. However when I run the same code with my checkpoint from fine tuning I’m getting on average 10.66s/it. This is 10x worse performance. Both models take the same amount of VRAM and in both cases GPU is running at close to 100%.

What is going on and how to fix it? I’m quite new to deep learning and this is a big mystery to me. Both checkpoints have the same size so it’s probably not parameter count. Both models run on float16 so it’s probably not that. Maybe model from Salesforce was somehow optimized? I found some fine tuning code (https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb#scrollTo=6cCVhsmJxxjH) that didn’t use trainer from transformers and it didn’t have any further optimization of trained model.

I realized what is going on. It turns out the problem lays in the fact that I was testing early checkpoint. Model wasn’t really good at this point and was adding random repetitions to the end. As a result sentences were much longer so they took a lot longer to generate. Today I tested better checkpoint and performance improved. I also tweaked repetition penalty and temperature and now problem has disappeared entirely.