"Attempting to unscale FP16 gradients" error when using optimizer in mixed precision training with Accelerate

Hi, I am trying to finetune a Blip2-OPT2.7b model with accelerate to escape from the dreadful ‘CUDA out of memory’ errors. However, I keep getting the following exception when the optimizer tries to step:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-44511e1fed33> in <cell line: 5>()
     26             # if accelerator.sync_gradients:
     27                 # accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm)
---> 28             myoptimizer.step()
     29             myoptimizer.zero_grad()

3 frames
/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py in _unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16)
    227                         continue
    228                     if (not allow_fp16) and param.grad.dtype == torch.float16:
--> 229                         raise ValueError("Attempting to unscale FP16 gradients.")
    230                     if param.grad.is_sparse:
    231                         # is_coalesced() == False means the sparse grad has values with duplicate indices.

ValueError: Attempting to unscale FP16 gradients.

Here is how I load the model:

quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_threshold=200.0)

processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("ybelkada/blip2-opt-2.7b-fp16-sharded", device_map="auto", 
                                                      quantization_config=quantization_config)

my optimizer:

optimizer = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995))

and training arguments:

training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
...
)

Here is how I am using Accelerate to prepare the objects:

if training_args.gradient_checkpointing:
    model.gradient_checkpointing_enable()

accelerator = Accelerator(mixed_precision='fp16',
                          gradient_accumulation_steps=training_args.gradient_accumulation_steps
                          )

mymodel, myoptimizer, mydataloader = accelerator.prepare(model, optimizer, train_dataloader) 
mymodel.train()

The training loop:

max_grad_norm = 1.0
with torch.cuda.amp.autocast():
    for idx, batch in enumerate(mydataloader, start=0):
        with accelerator.accumulate(mymodel):
            print(f"Batch: {idx}")
            
            input_ids = batch.pop("input_ids")#.to(device)
            pixel_values = batch.pop("pixel_values")#.to(device, torch.float16)

            outputs = mymodel(input_ids=input_ids,
                            pixel_values=pixel_values,
                            labels=input_ids)

            loss = outputs.loss
            print("Loss: ", loss.item())
            if (math.isnan(loss.item())):
              break
            accelerator.backward(loss)

            if accelerator.sync_gradients:
                accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm)
            myoptimizer.step()
            myoptimizer.zero_grad()

I’ve poked around and found that the issue is in the AcceleratedOptimizer. It throws the ValueError exception when the step function tries to rescale gradients. I’ve tried Gradscaler to scale the AcceleratedOptimizer, the same error is thrown then as well.

AcceleratedOptimizer (
Parameter Group 0
    betas: (0.9, 0.995)
    eps: 1e-08
    lr: 0.001
    weight_decay: 0
)

I’ve tried not using Accelerate to prepare the optimizer and using accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm) with the bitsandbytes optimizer on gradient steps but that makes the loss nan.

            if accelerator.sync_gradients:
                accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm)
                bnboptimizer.step()
                bnboptimizer.zero_grad()

Am I missing something or doing any step wrong?