RAM memory issues while training with torch.distributed.launch

orik · June 20, 2022, 8:57am

Hello everyone!

I’m training a model on a single machine with 8 GPUs using torch.distributed.launch. I’m using pretty standard settings - ConvBertForMaskedLM, dataloader_num_workers=8, etc.

My data includes 100M sentences of about 30 tokens each, each sample loaded ad-hoc using hd5 files within the data loader to save memory.

I get the following error during the evaluation phase of the first epoch:
Traceback (most recent call last):
File “…/lib/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “…/lib/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “…/lib/python3.9/site-packages/torch/distributed/launch.py”, line 340, in
main()
File “/…/lib/python3.9/site-packages/torch/distributed/launch.py”, line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File “…/lib/python3.9/site-packages/torch/distributed/launch.py”, line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
…
died with <Signals.SIGKILL: 9>.

I assume that’s because of a RAM memory issue.

Reducing the num_workers to 1 does not solve the issue.
Reducing the data from 100M sentences to say 10M does solve the issue.
Reducing the evaluation hoIdout set from using a holdout of 10% to 1% - also solves the issue.

This confused me, as data is loaded from hd5 files without caching (I’m opening the file at every call to Dataset.getitem()). Also, I expected changing the num_workers to also solve the issue if it stems from the data loader.

It seems an additional processing is done during evaluation that causes this failure. Did anyone else experience that? Tips for how to debug further will also be highly appreciated.

GenV · October 19, 2022, 7:55am

Solved? I have a similar issue

Topic		Replies	Views
Torch.distributed.launch question Beginners	2	3882	October 19, 2022
Dataloader_num_workers in a torch.distributed setup using HF Trainer Beginners	4	1764	January 19, 2022
Run crash with all GPU's and success with less 🤗Transformers	0	418	December 12, 2022
Trainer + Datasets + Pytorch Dataloader Workers - how to manage memory usage? 🤗Transformers	1	36	April 29, 2025
Out of memory error Beginners	0	835	January 26, 2023

RAM memory issues while training with torch.distributed.launch

Related topics