"Attempting to unscale FP16 gradients" error when using optimizer in mixed precision training with Accelerate

sadiatasneem · February 8, 2024, 12:26pm

Hi, I am trying to finetune a Blip2-OPT2.7b model with accelerate to escape from the dreadful ‘CUDA out of memory’ errors. However, I keep getting the following exception when the optimizer tries to step:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-44511e1fed33> in <cell line: 5>()
     26             # if accelerator.sync_gradients:
     27                 # accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm)
---> 28             myoptimizer.step()
     29             myoptimizer.zero_grad()

3 frames
/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py in _unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16)
    227                         continue
    228                     if (not allow_fp16) and param.grad.dtype == torch.float16:
--> 229                         raise ValueError("Attempting to unscale FP16 gradients.")
    230                     if param.grad.is_sparse:
    231                         # is_coalesced() == False means the sparse grad has values with duplicate indices.

ValueError: Attempting to unscale FP16 gradients.

Here is how I load the model:

quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_threshold=200.0)

processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("ybelkada/blip2-opt-2.7b-fp16-sharded", device_map="auto", 
                                                      quantization_config=quantization_config)

my optimizer:

optimizer = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995))

and training arguments:

training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
...
)

Here is how I am using Accelerate to prepare the objects:

if training_args.gradient_checkpointing:
    model.gradient_checkpointing_enable()

accelerator = Accelerator(mixed_precision='fp16',
                          gradient_accumulation_steps=training_args.gradient_accumulation_steps
                          )

mymodel, myoptimizer, mydataloader = accelerator.prepare(model, optimizer, train_dataloader) 
mymodel.train()

The training loop:

max_grad_norm = 1.0
with torch.cuda.amp.autocast():
    for idx, batch in enumerate(mydataloader, start=0):
        with accelerator.accumulate(mymodel):
            print(f"Batch: {idx}")
            
            input_ids = batch.pop("input_ids")#.to(device)
            pixel_values = batch.pop("pixel_values")#.to(device, torch.float16)

            outputs = mymodel(input_ids=input_ids,
                            pixel_values=pixel_values,
                            labels=input_ids)

            loss = outputs.loss
            print("Loss: ", loss.item())
            if (math.isnan(loss.item())):
              break
            accelerator.backward(loss)

            if accelerator.sync_gradients:
                accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm)
            myoptimizer.step()
            myoptimizer.zero_grad()

I’ve poked around and found that the issue is in the AcceleratedOptimizer. It throws the ValueError exception when the step function tries to rescale gradients. I’ve tried Gradscaler to scale the AcceleratedOptimizer, the same error is thrown then as well.

AcceleratedOptimizer (
Parameter Group 0
    betas: (0.9, 0.995)
    eps: 1e-08
    lr: 0.001
    weight_decay: 0
)

I’ve tried not using Accelerate to prepare the optimizer and using accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm) with the bitsandbytes optimizer on gradient steps but that makes the loss nan.

            if accelerator.sync_gradients:
                accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm)
                bnboptimizer.step()
                bnboptimizer.zero_grad()

Am I missing something or doing any step wrong?

marcsun13 · April 15, 2024, 1:52pm

Hi @sadiatasneem, you can’t train a bnb model out of the box. You need to use methods such as peft to do that.

Topic		Replies	Views
Issues when using `accelerate` with `fp16` Intermediate	4	11910	January 22, 2024
Error in clip_grad_norm_ for bf16 via PEFT 🤗Accelerate	1	1411	June 23, 2023
ValueError: Mixed precision training with AMP or APEX (`--fp16` or `--bf16`) and half precision evaluation (`--fp16_full_eval` or `--bf16_full_eval`) can only be used on CUDA devices 🤗Transformers	0	1960	May 17, 2022
ValueError: Mixed precision training with AMP or APEX (`--fp16`) and FP16 evaluation can only be used on CUDA devices 🤗Transformers	9	23362	April 24, 2024
Fine-tune OPT 13B: CUDA out of memory error (720gb vram, batch size 1, fp16)! Beginners	6	4561	July 25, 2022

"Attempting to unscale FP16 gradients" error when using optimizer in mixed precision training with Accelerate

Related topics