RAM memory issues while training with torch.distributed.launch

Hello everyone!

I’m training a model on a single machine with 8 GPUs using torch.distributed.launch. I’m using pretty standard settings - ConvBertForMaskedLM, dataloader_num_workers=8, etc.

My data includes 100M sentences of about 30 tokens each, each sample loaded ad-hoc using hd5 files within the data loader to save memory.

I get the following error during the evaluation phase of the first epoch:
Traceback (most recent call last):
File “…/lib/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “…/lib/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “…/lib/python3.9/site-packages/torch/distributed/launch.py”, line 340, in
File “/…/lib/python3.9/site-packages/torch/distributed/launch.py”, line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File “…/lib/python3.9/site-packages/torch/distributed/launch.py”, line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)

died with <Signals.SIGKILL: 9>.

I assume that’s because of a RAM memory issue.

  • Reducing the num_workers to 1 does not solve the issue.
  • Reducing the data from 100M sentences to say 10M does solve the issue.
  • Reducing the evaluation hoIdout set from using a holdout of 10% to 1% - also solves the issue.

This confused me, as data is loaded from hd5 files without caching (I’m opening the file at every call to Dataset.getitem()). Also, I expected changing the num_workers to also solve the issue if it stems from the data loader.

It seems an additional processing is done during evaluation that causes this failure. Did anyone else experience that? Tips for how to debug further will also be highly appreciated.