Error when instantiating dataloaders outside training_function

nielsr · June 29, 2021, 2:43pm

Currently testing out HuggingFace Accelerate on TPUs on Colab based on the example notebook.

My question: Do you have to instantiate the dataloaders inside the training function? Because I didn’t, and then I get this error:

UnboundLocalError: local variable 'train_dataloader' referenced before assignment
Exception in device=TPU:5: local variable 'train_dataloader' referenced before assignment
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/utils.py", line 274, in __call__
    self.launcher(*args)
  File "<ipython-input-20-001611a4ea01>", line 33, in training_function
    model, optimizer, train_dataloader, val_dataloader, test_dataloader
UnboundLocalError: local variable 'train_dataloader' referenced before assignment
Exception in device=TPU:4: local variable 'train_dataloader' referenced before assignment

It’s weird, because accelerator.prepare is defined within the training_function. The train_dataloader is a global variable, is that the problem?

cc @sgugger

nielsr · June 29, 2021, 2:52pm

Update: apparently you have to, because it only works when instantiating the dataloaders inside the training_function.

sgugger · June 29, 2021, 3:12pm

Yes we were just investigating that with @lewtun . Weirdly, it works if you name the prepared dataloader differently (train_dl for instance) but not if it has the same name as the global variable.

nielsr · June 30, 2021, 7:01am

Hmm yes that’s weird. Also, how can you access the model after you’ve trained it? Because the model variable is only defined within the training function.

nielsr · June 30, 2021, 7:39am

Update: I tried by adding model.save_pretrained(directory_name) within the function, such that you can access it after training, but this gives me an error:

Configuration saved in /content/config.json
---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
<ipython-input-28-a91f3c0bb4fd> in <module>()
      1 from accelerate import notebook_launcher
      2 
----> 3 notebook_launcher(training_function)

3 frames
/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    134                     error_pid=failed_process.pid,
    135                     exit_code=exitcode,
--> 136                     signal_name=name
    137                 )
    138             else:

ProcessExitedException: process 7 terminated with signal SIGKILL

This is probably because the model is still replicated on 8 TPU cores?

sgugger · June 30, 2021, 11:37am

You need to specifically use the save function PyTorch XLA gives you to avoid error, so to use accelerator.save. See the doc on saving for mor context but basically:

# Wait for all the processes to arrive
accelerator.wait_for_everyone()
# Save the real model, not the DDP wrapper
unwrapped_model = accelerator.unwrap_model(model)
# Use accelerator.save to save
model.save_pretrained(local_folder, save_function=accelerate.save)

See also this script for another example.

Topic		Replies	Views
Why do I get UnboundLocalError: local variable 'batch_idx' referenced before assignment when using interleaved data sets with Hugging Face (HF)? Beginners	2	690	January 18, 2024
UnboundLocalError: cannot access local variable 'input_ids' where it is not associated with a value 🤗Transformers	1	150	October 9, 2024
How to save model in Colab during TPU training with Accelerate Intermediate	2	1385	November 19, 2021
How to deal with DataCollator and DataLoaders in Huggingface? DeepSpeed	0	1150	February 2, 2023
LayoutLMv2 support in colab TPU Models	0	410	April 29, 2022

Error when instantiating dataloaders outside training_function

Related topics