Synchronizing State, Trainer and Accelerate

Using Trainer, and it appears that if I load any class from accelerate, the Trainer doesn’t perform its accelerate magic behind the scenes, meaning I get an error like this:

[rank1]:   File "/opt/code/repos/MyProject/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5779, in caching_allocator_warmup
[rank1]:     re.compile("|".join([re.escape(plan) for plan in model._tp_plan]))
[rank1]:                                                      ^^^^^^^^^^^^^^
[rank1]: TypeError: 'NoneType' object is not iterable

I have two use cases where I’d like slightly more control:

  1. My script creates a directory with a timestamp, and there is a synchronization issue that creates two checkpoint directories, one for each GPU.

  2. I load two models, the second attempt to load it always fails with this error. It appears that once the Trainer/TrainingArguments go out of scope, the accelerate process is torn down and doesn’t get reinitialized.

How can I take more control of the process? Is there a way to manually manage accelerate with the Trainer and TrainingArguments objects? How about synchronization primitives: something that allows a function to run on the main process before forking to the subprocesses? I tried the decorators, but they cause the Trainer code to crash with the same error.

1 Like

I have worked around this issue by modifying caching_allocator_warmup to set the tp_plan_regex to None if in addition to if _torch_distributed_available and torch.distributed.is_initialized() it checks if model._tp_plan is valid:
if _torch_distributed_available and torch.distributed.is_initialized() and hasattr(model, '_tp_plan') and model._tp_plan is not None.

This prevents the failure and ddp is working correctly across multiple invocations inside the Trainers.

I don’t know the implications of this _tp_plan modification, but my AI pair programmer suggests that when using accelerate launch and ddp, model._tp_plan should be None. (my pair programmer was not helpful in fixing this naturally - no impactful suggestions). If I understood it better I would create an issue and submit a pull request. For now, I will just monkeypatch it.

1 Like

Also noting that the few issues I’ve found related to the iteration over a None _tp_plan is the model’s fault and addressable through proper _post_init usage. This seems like a brittle solution and one that won’t scale across all the sources for custom models.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.