Hi, I am trying to finetune a Blip2-OPT2.7b model with accelerate to escape from the dreadful ‘CUDA out of memory’ errors. However, I keep getting the following exception when the optimizer tries to step:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-44511e1fed33> in <cell line: 5>()
26 # if accelerator.sync_gradients:
27 # accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm)
---> 28 myoptimizer.step()
29 myoptimizer.zero_grad()
3 frames
/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py in _unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16)
227 continue
228 if (not allow_fp16) and param.grad.dtype == torch.float16:
--> 229 raise ValueError("Attempting to unscale FP16 gradients.")
230 if param.grad.is_sparse:
231 # is_coalesced() == False means the sparse grad has values with duplicate indices.
ValueError: Attempting to unscale FP16 gradients.
Here is how I load the model:
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
llm_int8_threshold=200.0)
processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("ybelkada/blip2-opt-2.7b-fp16-sharded", device_map="auto",
quantization_config=quantization_config)
my optimizer:
optimizer = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995))
and training arguments:
training_args = TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
...
)
Here is how I am using Accelerate to prepare the objects:
if training_args.gradient_checkpointing:
model.gradient_checkpointing_enable()
accelerator = Accelerator(mixed_precision='fp16',
gradient_accumulation_steps=training_args.gradient_accumulation_steps
)
mymodel, myoptimizer, mydataloader = accelerator.prepare(model, optimizer, train_dataloader)
mymodel.train()
The training loop:
max_grad_norm = 1.0
with torch.cuda.amp.autocast():
for idx, batch in enumerate(mydataloader, start=0):
with accelerator.accumulate(mymodel):
print(f"Batch: {idx}")
input_ids = batch.pop("input_ids")#.to(device)
pixel_values = batch.pop("pixel_values")#.to(device, torch.float16)
outputs = mymodel(input_ids=input_ids,
pixel_values=pixel_values,
labels=input_ids)
loss = outputs.loss
print("Loss: ", loss.item())
if (math.isnan(loss.item())):
break
accelerator.backward(loss)
if accelerator.sync_gradients:
accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm)
myoptimizer.step()
myoptimizer.zero_grad()
I’ve poked around and found that the issue is in the AcceleratedOptimizer. It throws the ValueError exception when the step function tries to rescale gradients. I’ve tried Gradscaler to scale the AcceleratedOptimizer, the same error is thrown then as well.
AcceleratedOptimizer (
Parameter Group 0
betas: (0.9, 0.995)
eps: 1e-08
lr: 0.001
weight_decay: 0
)
I’ve tried not using Accelerate to prepare the optimizer and using accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm) with the bitsandbytes optimizer on gradient steps but that makes the loss nan.
if accelerator.sync_gradients:
accelerator.clip_grad_norm_(mymodel.parameters(), max_grad_norm)
bnboptimizer.step()
bnboptimizer.zero_grad()
Am I missing something or doing any step wrong?