I am trying to fully fine-tune the model, giving audio + text + image as input. The script works only if the fine-tuning mode is lora (~800k parameters). Otherwise, I got multiple different types of errors.
Note: Due to the computational constraints, I created a case called hybrid, when I unfreeze embeddings and LoRA (1.5B)
The problem exists only in the cluster with torchrun
and docker. I could agree that it might be the problem if I used multiple GPUs, but I rewrote everything to try it on a single one.
def configure_model_for_training(model, args):
"""Configure model for training based on finetuning mode and modalities"""
torch.cuda.empty_cache()
gc.collect()
if args.finetuning_mode == 'full':
print("Enabling full finetuning")
for param in model.parameters():
if param not in model.model.embed_tokens_extend.audio_embed.parameters():
param.requires_grad = True
elif args.finetuning_mode == 'lora':
print("LoRA")
else:
print("Test")
if args.use_audio and hasattr(model, 'model') and hasattr(model.model, 'embed_tokens_extend'):
if hasattr(model.model.embed_tokens_extend, 'audio_embed'):
for param in model.model.embed_tokens_extend.audio_embed.parameters():
param.requires_grad = True
print("Enabled gradients for audio embedding layer")
if args.use_image and hasattr(model, 'model') and hasattr(model.model, 'embed_tokens_extend'):
if hasattr(model.model.embed_tokens_extend, 'image_embed'):
for param in model.model.embed_tokens_extend.image_embed.parameters():
param.requires_grad = True
print("Enabled gradients for image embedding layer")
print("\nTrainable parameters:")
for name, param in model.named_parameters():
if param.requires_grad:
print(f" - {name}")
model = create_model(
model_path,
use_flash_attention=args.use_flash_attention,
)
model.train()
configure_model_for_training(model, args)
run_name = f"{train_dataset.name}_{args.finetuning_mode}_{args.model_name_or_path}_{args.number_of_train_users}_{args.num_train_epochs}-{args.max_steps}_{args.number_of_eval_users}_{args.batch_size}_{args.learning_rate}_{args.wd}"
output_dir = Path(args.output_dir) / run_name
output_dir.mkdir(parents=True, exist_ok=True)
training_args = TrainingArguments(
num_train_epochs=args.num_train_epochs,
max_steps=args.max_steps,
per_device_train_batch_size=args.batch_size_per_gpu,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
gradient_accumulation_steps=gradient_accumulation_steps,
optim='adamw_torch',
adam_beta1=0.9,
adam_beta2=0.95,
adam_epsilon=1e-7,
learning_rate=args.learning_rate,
weight_decay=args.wd,
max_grad_norm=1.0,
lr_scheduler_type='linear',
warmup_steps=50,
logging_steps=10,
output_dir=str(output_dir),
save_strategy='no',
save_total_limit=5,
save_only_model=True,
bf16=bf16,
fp16=fp16,
remove_unused_columns=False,
report_to='wandb' if not args.no_wandb else None,
run_name=run_name,
deepspeed=None,
disable_tqdm=not args.tqdm,
dataloader_num_workers=4,
ddp_find_unused_parameters=True
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=multi_modal_qa_collate_fn,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
callbacks=callbacks
)
# Training
logger.info(f"Starting {train_dataset.name} training with {args.finetuning_mode} finetuning")
trainer.train()
I got multiple types of errors (only applied for the test & full mode):
-
with this code I got:
[rank0]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. [rank0]: Parameter at index 1329 with name model.embed_tokens_extend.audio_embed.encoder.encoders.23._checkpoint_wrapped_module.layer_norm.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
-
gradient_checkpointing = False, then:
AttributeError: ‘SiglipEncoder’ object has no attribute ‘_gradient_checkpointing_func’. Did you mean: ‘gradient_checkpointing’? -
ddp_find_unused_parameters=False
but then I got:
Parameter indices which did not receive grad for rank 0: 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 1331 1332 1333 1334 1340 1342 1344 1346 1348 1350 1352 1354 1356 1358 1360 1362 1364 1366 1368 1370 1372 1374 1376 1378 1380 1382 1384 1386 1388 1390 1392 1394 1396 1398 1400 1402 1404 1406 1408 1410 1412 1414 1416 1418 1420 1422 1424 1426 1428 1430 1432 1434 1436 1438 1440 1442 1444 1446 1448 1450 1452 1454 1456 1458 1460 1462 1464 1466 1468 1470 1472 ...
I call the script: torchrun --nproc_per_node=1 $@
setup:
- Python3.12 / Python3.10 (I had dockers with both, original from nvidia)
- transformers 4.46.3
- torch2.6 / torch 2.3
- datasets
- both dockers work well for LoRA
- datasets==3.5.0
- peft==0.15.2
- accelerate==1.6.0
Note: I tried also with higher versions of transformers and accelerate.
Now, I am asking myself the same question I always ask myself before going to bed: what did I do wrong this time?