Error when instantiating dataloaders outside training_function

Currently testing out HuggingFace Accelerate on TPUs on Colab based on the example notebook.

My question: Do you have to instantiate the dataloaders inside the training function? Because I didn’t, and then I get this error:

UnboundLocalError: local variable 'train_dataloader' referenced before assignment
Exception in device=TPU:5: local variable 'train_dataloader' referenced before assignment
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/utils.py", line 274, in __call__
    self.launcher(*args)
  File "<ipython-input-20-001611a4ea01>", line 33, in training_function
    model, optimizer, train_dataloader, val_dataloader, test_dataloader
UnboundLocalError: local variable 'train_dataloader' referenced before assignment
Exception in device=TPU:4: local variable 'train_dataloader' referenced before assignment

It’s weird, because accelerator.prepare is defined within the training_function. The train_dataloader is a global variable, is that the problem?

cc @sgugger

Update: apparently you have to, because it only works when instantiating the dataloaders inside the training_function.

Yes we were just investigating that with @lewtun . Weirdly, it works if you name the prepared dataloader differently (train_dl for instance) but not if it has the same name as the global variable.

1 Like

Hmm yes that’s weird. Also, how can you access the model after you’ve trained it? Because the model variable is only defined within the training function.

Update: I tried by adding model.save_pretrained(directory_name) within the function, such that you can access it after training, but this gives me an error:

Configuration saved in /content/config.json
---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
<ipython-input-28-a91f3c0bb4fd> in <module>()
      1 from accelerate import notebook_launcher
      2 
----> 3 notebook_launcher(training_function)

3 frames
/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    134                     error_pid=failed_process.pid,
    135                     exit_code=exitcode,
--> 136                     signal_name=name
    137                 )
    138             else:

ProcessExitedException: process 7 terminated with signal SIGKILL

This is probably because the model is still replicated on 8 TPU cores?

You need to specifically use the save function PyTorch XLA gives you to avoid error, so to use accelerator.save. See the doc on saving for mor context but basically:

# Wait for all the processes to arrive
accelerator.wait_for_everyone()
# Save the real model, not the DDP wrapper
unwrapped_model = accelerator.unwrap_model(model)
# Use accelerator.save to save
model.save_pretrained(local_folder, save_function=accelerate.save)

See also this script for another example.

2 Likes