Synchronizing State, Trainer and Accelerate

donb · May 22, 2025, 1:25am

Using Trainer, and it appears that if I load any class from accelerate, the Trainer doesn’t perform its accelerate magic behind the scenes, meaning I get an error like this:

[rank1]:   File "/opt/code/repos/MyProject/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5779, in caching_allocator_warmup
[rank1]:     re.compile("|".join([re.escape(plan) for plan in model._tp_plan]))
[rank1]:                                                      ^^^^^^^^^^^^^^
[rank1]: TypeError: 'NoneType' object is not iterable

I have two use cases where I’d like slightly more control:

My script creates a directory with a timestamp, and there is a synchronization issue that creates two checkpoint directories, one for each GPU.
I load two models, the second attempt to load it always fails with this error. It appears that once the Trainer/TrainingArguments go out of scope, the accelerate process is torn down and doesn’t get reinitialized.

How can I take more control of the process? Is there a way to manually manage accelerate with the Trainer and TrainingArguments objects? How about synchronization primitives: something that allows a function to run on the main process before forking to the subprocesses? I tried the decorators, but they cause the Trainer code to crash with the same error.

donb · May 22, 2025, 4:45pm

I have worked around this issue by modifying caching_allocator_warmup to set the tp_plan_regex to None if in addition to if _torch_distributed_available and torch.distributed.is_initialized() it checks if model._tp_plan is valid:
if _torch_distributed_available and torch.distributed.is_initialized() and hasattr(model, '_tp_plan') and model._tp_plan is not None.

This prevents the failure and ddp is working correctly across multiple invocations inside the Trainers.

I don’t know the implications of this _tp_plan modification, but my AI pair programmer suggests that when using accelerate launch and ddp, model._tp_plan should be None. (my pair programmer was not helpful in fixing this naturally - no impactful suggestions). If I understood it better I would create an issue and submit a pull request. For now, I will just monkeypatch it.

donb · May 22, 2025, 4:47pm

Also noting that the few issues I’ve found related to the iteration over a None _tp_plan is the model’s fault and addressable through proper _post_init usage. This seems like a brittle solution and one that won’t scale across all the sources for custom models.

system · May 23, 2025, 4:48am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Trainer API for Model Parallelism on Multiple GPUs 🤗Transformers	5	4146	September 10, 2024
Trainer and Accelerate 🤗Transformers	13	10138	September 19, 2024
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23164	May 8, 2023
DDP Program hang/stuck in trainer.predict() and trainer.evaluate() 🤗Accelerate	2	749	February 15, 2024
Crash during training 🤗Hub	3	714	December 20, 2023

Synchronizing State, Trainer and Accelerate

Related topics