Using Trainer, and it appears that if I load any class from accelerate, the Trainer doesn’t perform its accelerate magic behind the scenes, meaning I get an error like this:
[rank1]: File "/opt/code/repos/MyProject/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5779, in caching_allocator_warmup
[rank1]: re.compile("|".join([re.escape(plan) for plan in model._tp_plan]))
[rank1]: ^^^^^^^^^^^^^^
[rank1]: TypeError: 'NoneType' object is not iterable
I have two use cases where I’d like slightly more control:
-
My script creates a directory with a timestamp, and there is a synchronization issue that creates two checkpoint directories, one for each GPU.
-
I load two models, the second attempt to load it always fails with this error. It appears that once the Trainer/TrainingArguments go out of scope, the accelerate process is torn down and doesn’t get reinitialized.
How can I take more control of the process? Is there a way to manually manage accelerate with the Trainer and TrainingArguments objects? How about synchronization primitives: something that allows a function to run on the main process before forking to the subprocesses? I tried the decorators, but they cause the Trainer code to crash with the same error.