BLIP2 generation outputs depends on batch size

My question is probably related to a few other ones that people have asked on here (mainly this one) but these questions haven’t been answered and assuming I’m not totally off-base the implications are sort of concerning.

I’ve been fine tuning a Blip2ForConditionalGeneration model recently on the VQAv2 dataset and noticed inconsistencies in the conditional outputs depending on the size of the batch you feed to the model. I described the issue in detail here with the main idea being that the autoregressive logits from the language modelling objective for a sample change whether you pass the sample individually through the model, or batch it with others. As a result the model.generate() call is also influenced by the number of samples you batch together even if a deterministic generation scheme like greedy decoding is used.

As an example if I generate a few answers to the same vqa samples but vary the batch size the results change drastically:

base_model_id = "Salesforce/blip2-opt-2.7b"
model = Blip2ForConditionalGeneration.from_pretrained(base_model_id, device_map = "auto")
processor = Blip2Processor.from_pretrained(base_model_id)

dataset = #

# salesforce generation parameters 
lavis_params = {
    "min_length" : 1,
    "max_length" : 10,
    "num_beams" : 5,
    "length_penalty" : -1

BATCH_SIZE = #[1,2,4,16]

inputs = processor(dataset['image'][0:BATCH_SIZE], 
                   padding = True,
                   return_tensors = 'pt',
                   return_attention_mask = True)

outputs = model.generate(**inputs, **lavis_params)
outputs = processor.tokenizer.batch_decode(outputs, skip_special_tokens = True)

Using the above code and generating the outputs for the same 4 samples (either by looping in the case of BATCH_SIZE=1 or 2, or by using a sufficiently large batch with BATCH_SIZE=4,16) you get inconsistent answers for the same questions

- Question: Where is he looking? Short answer:
- Question: What are the people in the background doing? Short answer:
- Question: What is he on top of? Short answer:
- Question: What website copyrighted the picture? Short answer:

# printing outputs depending on batch size ... 

>> [The sky', 'nothing', 'skateboard','none']

>> ['','nothing','skateboard','']



I was wondering if anyone had an explanation for what is going on here or if this is a well-established problem that’s being addressed by the HF team (or option 3 i’m just stupid).

I guess the problem might be caused by padding the question inputs. Have you tried setting tokenizer.padding_side = ‘left’? The prompted questions should be left-padded since the model is based on the decoder-only OPT model. The problem vanished when batch_size == 1, because the input does not need padding. However, when batch_size > 1, then the tokenized inputs may not work correctly because they are right-padded by default.