Trainer CUDA OOM error when saving optimizer

Environment info

  • transformers version: 4.10.2
  • Platform: Linux-4.19.117.bsk.5-amd64-x86_64-with-debian-10.10
  • Python version: 3.7.3
  • PyTorch version (GPU?): 1.9.0+cu111 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: distributed training

Models:

  • Customized Model. Basically, XLM-Roberta + Layout information (Similar to the Layout LM model)

Library:

Information

Model I am using (XLM-Roberta + Layout:

The problem arises when using:
The language modeling script: transformers/run_mlm.py at master · huggingface/transformers · GitHub

The tasks I am working on is:
my own dataset

Error during saving the optimizer, especially this step consolidate_state_dict:

[INFO|trainer.py:2183] 2021-11-25 01:24:58,904 >>   Num examples = 120410
[INFO|trainer.py:2186] 2021-11-25 01:24:58,905 >>   Batch size = 8
{'eval_loss': 2.7650530338287354, 'eval_runtime': 276.6485, 'eval_samples_per_second': 435.245, 'eval_steps_per_second': 6.803, 'epoch': 1.18}
  0%|          | 10000/10000000 [1:16:22<1240:39:21[INFO|trainer.py:1935] 2021-11-25 01:29:35,564 >> Saving model checkpoint to model_files/pretrain_online/checkpoint-10000
[INFO|configuration_utils.py:391] 2021-11-25 01:29:35,565 >> Configuration saved in model_files/pretrain_online/checkpoint-10000/config.json
[INFO|modeling_utils.py:1001] 2021-11-25 01:29:38,209 >> Model weights saved in model_files/pretrain_online/checkpoint-10000/pytorch_model.bin
[INFO|tokenization_utils_base.py:2020] 2021-11-25 01:29:38,210 >> tokenizer config file saved in model_files/pretrain_online/checkpoint-10000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2026] 2021-11-25 01:29:38,210 >> Special tokens file saved in model_files/pretrain_online/checkpoint-10000/special_tokens_map.json
Traceback (most recent call last):
  File "mlm_main.py", line 169, in <module>
    main()
  File "mlm_main.py", line 154, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1340, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1449, in _maybe_log_save_evaluate
Traceback (most recent call last):
  File "mlm_main.py", line 169, in <module>
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1517, in _save_checkpoint
    main()
  File "mlm_main.py", line 154, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1340, in train
    self.optimizer.consolidate_state_dict()
  File "/home/tiger/.local/lib/python3.7/site-packages/fairscale/optim/oss.py", line 362, in consolidate_state_dict
    group=self.group,
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1743, in broadcast_object_list
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1449, in _maybe_log_save_evaluate
        object_list[i] = _tensor_to_object(obj_view, obj_size)self._save_checkpoint(model, trial, metrics=metrics)

  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1456, in _tensor_to_object
  File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1517, in _save_checkpoint
    return _unpickler(io.BytesIO(buf)).load()
  File "/usr/local/lib/python3.7/dist-packages/torch/storage.py", line 161, in _load_from_bytes
    self.optimizer.consolidate_state_dict()
  File "/home/tiger/.local/lib/python3.7/site-packages/fairscale/optim/oss.py", line 362, in consolidate_state_dict
    return torch.load(io.BytesIO(b))
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 608, in load
    group=self.group,
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1743, in broadcast_object_list
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 787, in _legacy_load
    result = unpickler.load()
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 743, in persistent_load
    object_list[i] = _tensor_to_object(obj_view, obj_size)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1456, in _tensor_to_object
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 155, in _cuda_deserialize
    return storage_type(obj.size())
  File "/usr/local/lib/python3.7/dist-packages/torch/cuda/__init__.py", line 528, in _lazy_new
    return _unpickler(io.BytesIO(buf)).load()
  File "/usr/local/lib/python3.7/dist-packages/torch/storage.py", line 161, in _load_from_bytes
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError    : return torch.load(io.BytesIO(b))CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Solution

  1. When I change batch size of each GPU from 8 → 2. It works well.
  2. also when I disable the optimizer saving by commenting out consolidate_state_dict as well as the optimizer saving part, it works well.

Wonder how come the optimizer saving is taking a lot of GPU memory.