Currently testing out HuggingFace Accelerate on TPUs on Colab based on the example notebook.
My question: Do you have to instantiate the dataloaders inside the training function? Because I didn’t, and then I get this error:
UnboundLocalError: local variable 'train_dataloader' referenced before assignment
Exception in device=TPU:5: local variable 'train_dataloader' referenced before assignment
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/utils.py", line 274, in __call__
self.launcher(*args)
File "<ipython-input-20-001611a4ea01>", line 33, in training_function
model, optimizer, train_dataloader, val_dataloader, test_dataloader
UnboundLocalError: local variable 'train_dataloader' referenced before assignment
Exception in device=TPU:4: local variable 'train_dataloader' referenced before assignment
It’s weird, because accelerator.prepare is defined within the training_function. The train_dataloader is a global variable, is that the problem?
Yes we were just investigating that with @lewtun . Weirdly, it works if you name the prepared dataloader differently (train_dl for instance) but not if it has the same name as the global variable.
Hmm yes that’s weird. Also, how can you access the model after you’ve trained it? Because the model variable is only defined within the training function.
Update: I tried by adding model.save_pretrained(directory_name) within the function, such that you can access it after training, but this gives me an error:
Configuration saved in /content/config.json
---------------------------------------------------------------------------
ProcessExitedException Traceback (most recent call last)
<ipython-input-28-a91f3c0bb4fd> in <module>()
1 from accelerate import notebook_launcher
2
----> 3 notebook_launcher(training_function)
3 frames
/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
134 error_pid=failed_process.pid,
135 exit_code=exitcode,
--> 136 signal_name=name
137 )
138 else:
ProcessExitedException: process 7 terminated with signal SIGKILL
This is probably because the model is still replicated on 8 TPU cores?
You need to specifically use the save function PyTorch XLA gives you to avoid error, so to use accelerator.save. See the doc on saving for mor context but basically:
# Wait for all the processes to arrive
accelerator.wait_for_everyone()
# Save the real model, not the DDP wrapper
unwrapped_model = accelerator.unwrap_model(model)
# Use accelerator.save to save
model.save_pretrained(local_folder, save_function=accelerate.save)