My question is probably related to a few other ones that people have asked on here (mainly this one) but these questions haven’t been answered and assuming I’m not totally off-base the implications are sort of concerning.
I’ve been fine tuning a Blip2ForConditionalGeneration model recently on the VQAv2 dataset and noticed inconsistencies in the conditional outputs depending on the size of the batch you feed to the model. I described the issue in detail here with the main idea being that the autoregressive logits from the language modelling objective for a sample change whether you pass the sample individually through the model, or batch it with others. As a result the model.generate()
call is also influenced by the number of samples you batch together even if a deterministic generation scheme like greedy decoding is used.
As an example if I generate a few answers to the same vqa samples but vary the batch size the results change drastically:
base_model_id = "Salesforce/blip2-opt-2.7b"
model = Blip2ForConditionalGeneration.from_pretrained(base_model_id, device_map = "auto")
processor = Blip2Processor.from_pretrained(base_model_id)
dataset = # https://huggingface.co/datasets/HuggingFaceM4/VQAv2
# salesforce generation parameters
lavis_params = {
"min_length" : 1,
"max_length" : 10,
"num_beams" : 5,
"length_penalty" : -1
}
BATCH_SIZE = #[1,2,4,16]
inputs = processor(dataset['image'][0:BATCH_SIZE],
dataset['question'][0:BATCH_SIZE],
padding = True,
return_tensors = 'pt',
return_attention_mask = True)
outputs = model.generate(**inputs, **lavis_params)
outputs = processor.tokenizer.batch_decode(outputs, skip_special_tokens = True)
Using the above code and generating the outputs for the same 4 samples (either by looping in the case of BATCH_SIZE
=1 or 2, or by using a sufficiently large batch with BATCH_SIZE
=4,16) you get inconsistent answers for the same questions
"""
questions:
- Question: Where is he looking? Short answer:
- Question: What are the people in the background doing? Short answer:
- Question: What is he on top of? Short answer:
- Question: What website copyrighted the picture? Short answer:
"""
# printing outputs depending on batch size ...
# BATCH_SIZE=1
>> [The sky', 'nothing', 'skateboard','none']
# BATCH_SIZE=2
>> ['','nothing','skateboard','']
# BATCH_SIZE=4
>>['','nothing','','']
# BATCH_SIZE=16
>>['','nothing','','']
I was wondering if anyone had an explanation for what is going on here or if this is a well-established problem that’s being addressed by the HF team (or option 3 i’m just stupid).