Problem with full-finetuning on cluster

bartMarek · June 24, 2025, 9:20pm

I am trying to fully fine-tune the model, giving audio + text + image as input. The script works only if the fine-tuning mode is lora (~800k parameters). Otherwise, I got multiple different types of errors.

Note: Due to the computational constraints, I created a case called hybrid, when I unfreeze embeddings and LoRA (1.5B)

The problem exists only in the cluster with torchrun and docker. I could agree that it might be the problem if I used multiple GPUs, but I rewrote everything to try it on a single one.

def configure_model_for_training(model, args):
    """Configure model for training based on finetuning mode and modalities"""

    torch.cuda.empty_cache()    
    gc.collect()
        
    if args.finetuning_mode == 'full':
        print("Enabling full finetuning")
        for param in model.parameters():
            if param not in model.model.embed_tokens_extend.audio_embed.parameters():
                param.requires_grad = True
    elif args.finetuning_mode == 'lora':
        print("LoRA")
    else:
        print("Test")
        if args.use_audio and hasattr(model, 'model') and hasattr(model.model, 'embed_tokens_extend'):
            if hasattr(model.model.embed_tokens_extend, 'audio_embed'):
                for param in model.model.embed_tokens_extend.audio_embed.parameters():
                        param.requires_grad = True
                print("Enabled gradients for audio embedding layer")
                
        if args.use_image and hasattr(model, 'model') and hasattr(model.model, 'embed_tokens_extend'):
            if hasattr(model.model.embed_tokens_extend, 'image_embed'):
                for param in model.model.embed_tokens_extend.image_embed.parameters():

                    param.requires_grad = True
                print("Enabled gradients for image embedding layer")

    print("\nTrainable parameters:")
    for name, param in model.named_parameters():
        if param.requires_grad:
            print(f"  - {name}")



 model = create_model(
        model_path,
        use_flash_attention=args.use_flash_attention,
    )
    model.train()

    configure_model_for_training(model, args)
 run_name = f"{train_dataset.name}_{args.finetuning_mode}_{args.model_name_or_path}_{args.number_of_train_users}_{args.num_train_epochs}-{args.max_steps}_{args.number_of_eval_users}_{args.batch_size}_{args.learning_rate}_{args.wd}"
    
 output_dir = Path(args.output_dir) / run_name
 output_dir.mkdir(parents=True, exist_ok=True)
 training_args = TrainingArguments(
        num_train_epochs=args.num_train_epochs,
        max_steps=args.max_steps,
        per_device_train_batch_size=args.batch_size_per_gpu,
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},
        gradient_accumulation_steps=gradient_accumulation_steps,
        optim='adamw_torch',
        adam_beta1=0.9,
        adam_beta2=0.95,
        adam_epsilon=1e-7,
        learning_rate=args.learning_rate,
        weight_decay=args.wd,
        max_grad_norm=1.0,
        lr_scheduler_type='linear',
        warmup_steps=50,
        logging_steps=10,
        output_dir=str(output_dir),
        save_strategy='no',
        save_total_limit=5,
        save_only_model=True,
        bf16=bf16,
        fp16=fp16,
        remove_unused_columns=False,
        report_to='wandb' if not args.no_wandb else None,
        run_name=run_name,
        deepspeed=None,
        disable_tqdm=not args.tqdm,
        dataloader_num_workers=4,
        ddp_find_unused_parameters=True
    )

 trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=multi_modal_qa_collate_fn,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        callbacks=callbacks
    )
    
    # Training
  logger.info(f"Starting {train_dataset.name} training with {args.finetuning_mode} finetuning")
  trainer.train()

I got multiple types of errors (only applied for the test & full mode):

with this code I got:
[rank0]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. [rank0]: Parameter at index 1329 with name model.embed_tokens_extend.audio_embed.encoder.encoders.23._checkpoint_wrapped_module.layer_norm.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
gradient_checkpointing = False, then:
AttributeError: ‘SiglipEncoder’ object has no attribute ‘_gradient_checkpointing_func’. Did you mean: ‘gradient_checkpointing’?
ddp_find_unused_parameters=False
but then I got:
Parameter indices which did not receive grad for rank 0: 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 1331 1332 1333 1334 1340 1342 1344 1346 1348 1350 1352 1354 1356 1358 1360 1362 1364 1366 1368 1370 1372 1374 1376 1378 1380 1382 1384 1386 1388 1390 1392 1394 1396 1398 1400 1402 1404 1406 1408 1410 1412 1414 1416 1418 1420 1422 1424 1426 1428 1430 1432 1434 1436 1438 1440 1442 1444 1446 1448 1450 1452 1454 1456 1458 1460 1462 1464 1466 1468 1470 1472 ...

I call the script: torchrun --nproc_per_node=1 $@

setup:

Python3.12 / Python3.10 (I had dockers with both, original from nvidia)
transformers 4.46.3
torch2.6 / torch 2.3
datasets
both dockers work well for LoRA
datasets==3.5.0
peft==0.15.2
accelerate==1.6.0

Note: I tried also with higher versions of transformers and accelerate.

Now, I am asking myself the same question I always ask myself before going to bed: what did I do wrong this time?

John6666 · June 25, 2025, 2:54am

I wonder if there is a compatibility issue between gradient_checkpointing and DDP…

github.com/huggingface/accelerate

RuntimeError: Expected to mark a variable ready only once.

opened 01:53PM - 24 May 22 UTC

closed 03:16PM - 01 Jul 22 UTC

nforest

Hello! I found that when I try to enable gradient_checking with model.train()…, a runtime error will be raised. Hope someone will give me some feedback to solve it. Thanks! - Minimal PoC Code ```python # grad_checking.py import torch import transformers from accelerate import Accelerator from transformers import AutoModel, AutoTokenizer # seems not related to a specific model, longformer/gpt2 can also lead to the crash model_path = "facebook/opt-125m" accelerator = Accelerator() x = ['whoami', 'hello world'] y = ['i am who', 'world hello'] tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True) x = tokenizer(x, padding=True, return_tensors='pt') y = tokenizer(y, padding=True, return_tensors='pt') model = AutoModel.from_pretrained(model_path, local_files_only=True) model.gradient_checkpointing_enable() model = accelerator.prepare(model) # when gradient_checkpointing is enabled, this line below will lead to the error model.train() x_embed = model(**x, return_dict=True).last_hidden_state[:, -1, :] y_embed = model(**y, return_dict=True).last_hidden_state[:, -1, :] logits = torch.matmul(x_embed, y_embed.t()) loss = transformers.models.clip.modeling_clip.clip_loss(logits) accelerator.backward(loss) ``` command line to launch the script: `TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch grad_checking.py` ``` RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 190 with name decoder.layers.11.fc2.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. ``` - Environment ```bash - `Accelerate` version: 0.9.0 - Platform: Linux-5.4.32-1-tlinux4-0001-x86_64-with-glibc2.2.5 - Python version: 3.8.3 - Numpy version: 1.22.4 - PyTorch version (GPU?): 1.11.0+cu113 (True) - `Accelerate` default config: - compute_environment: LOCAL_MACHINE - distributed_type: MULTI_GPU - mixed_precision: fp16 - use_cpu: False - num_processes: 1 - machine_rank: 0 - num_machines: 1 - main_process_ip: None - main_process_port: None - main_training_function: main - deepspeed_config: {} - fsdp_config: {} ```

Topic		Replies	Views
DDP gradient checkpoint crashes Beginners	4	3402	February 24, 2024
Error when fine-tuning on multi-gpu 🤗Transformers	1	590	February 17, 2025
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss 🤗Accelerate	3	4741	January 24, 2024
Error while fine tuning with peft, lora, accelerate, SFTConfig and SFTTrainer Models	3	1582	November 7, 2024
Am I doing multiple GPU right? Intermediate	8	447	November 29, 2024

Problem with full-finetuning on cluster

Related topics