Error when instantiating dataloaders outside training_function

Currently testing out HuggingFace Accelerate on TPUs on Colab based on the example notebook.

My question: Do you have to instantiate the dataloaders inside the training function? Because I didn’t, and then I get this error:

UnboundLocalError: local variable 'train_dataloader' referenced before assignment
Exception in device=TPU:5: local variable 'train_dataloader' referenced before assignment
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/", line 274, in __call__
  File "<ipython-input-20-001611a4ea01>", line 33, in training_function
    model, optimizer, train_dataloader, val_dataloader, test_dataloader
UnboundLocalError: local variable 'train_dataloader' referenced before assignment
Exception in device=TPU:4: local variable 'train_dataloader' referenced before assignment

It’s weird, because accelerator.prepare is defined within the training_function. The train_dataloader is a global variable, is that the problem?

cc @sgugger

Update: apparently you have to, because it only works when instantiating the dataloaders inside the training_function.

Yes we were just investigating that with @lewtun . Weirdly, it works if you name the prepared dataloader differently (train_dl for instance) but not if it has the same name as the global variable.

1 Like

Hmm yes that’s weird. Also, how can you access the model after you’ve trained it? Because the model variable is only defined within the training function.

Update: I tried by adding model.save_pretrained(directory_name) within the function, such that you can access it after training, but this gives me an error:

Configuration saved in /content/config.json
ProcessExitedException                    Traceback (most recent call last)
<ipython-input-28-a91f3c0bb4fd> in <module>()
      1 from accelerate import notebook_launcher
----> 3 notebook_launcher(training_function)

3 frames
/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/ in join(self, timeout)
    134           ,
    135                     exit_code=exitcode,
--> 136                     signal_name=name
    137                 )
    138             else:

ProcessExitedException: process 7 terminated with signal SIGKILL

This is probably because the model is still replicated on 8 TPU cores?

You need to specifically use the save function PyTorch XLA gives you to avoid error, so to use See the doc on saving for mor context but basically:

# Wait for all the processes to arrive
# Save the real model, not the DDP wrapper
unwrapped_model = accelerator.unwrap_model(model)
# Use to save

See also this script for another example.