Environment info
-
transformers
version: 4.10.2 - Platform: Linux-4.19.117.bsk.5-amd64-x86_64-with-debian-10.10
- Python version: 3.7.3
- PyTorch version (GPU?): 1.9.0+cu111 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: distributed training
Models:
- Customized Model. Basically, XLM-Roberta + Layout information (Similar to the Layout LM model)
Library:
- Trainer: @sgugger
Information
Model I am using (XLM-Roberta + Layout:
The problem arises when using:
The language modeling script: transformers/run_mlm.py at master · huggingface/transformers · GitHub
The tasks I am working on is:
my own dataset
Error during saving the optimizer, especially this step consolidate_state_dict
:
[INFO|trainer.py:2183] 2021-11-25 01:24:58,904 >> Num examples = 120410
[INFO|trainer.py:2186] 2021-11-25 01:24:58,905 >> Batch size = 8
{'eval_loss': 2.7650530338287354, 'eval_runtime': 276.6485, 'eval_samples_per_second': 435.245, 'eval_steps_per_second': 6.803, 'epoch': 1.18}
0%| | 10000/10000000 [1:16:22<1240:39:21[INFO|trainer.py:1935] 2021-11-25 01:29:35,564 >> Saving model checkpoint to model_files/pretrain_online/checkpoint-10000
[INFO|configuration_utils.py:391] 2021-11-25 01:29:35,565 >> Configuration saved in model_files/pretrain_online/checkpoint-10000/config.json
[INFO|modeling_utils.py:1001] 2021-11-25 01:29:38,209 >> Model weights saved in model_files/pretrain_online/checkpoint-10000/pytorch_model.bin
[INFO|tokenization_utils_base.py:2020] 2021-11-25 01:29:38,210 >> tokenizer config file saved in model_files/pretrain_online/checkpoint-10000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2026] 2021-11-25 01:29:38,210 >> Special tokens file saved in model_files/pretrain_online/checkpoint-10000/special_tokens_map.json
Traceback (most recent call last):
File "mlm_main.py", line 169, in <module>
main()
File "mlm_main.py", line 154, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1340, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1449, in _maybe_log_save_evaluate
Traceback (most recent call last):
File "mlm_main.py", line 169, in <module>
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1517, in _save_checkpoint
main()
File "mlm_main.py", line 154, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1340, in train
self.optimizer.consolidate_state_dict()
File "/home/tiger/.local/lib/python3.7/site-packages/fairscale/optim/oss.py", line 362, in consolidate_state_dict
group=self.group,
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1743, in broadcast_object_list
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1449, in _maybe_log_save_evaluate
object_list[i] = _tensor_to_object(obj_view, obj_size)self._save_checkpoint(model, trial, metrics=metrics)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1456, in _tensor_to_object
File "/home/tiger/.local/lib/python3.7/site-packages/transformers/trainer.py", line 1517, in _save_checkpoint
return _unpickler(io.BytesIO(buf)).load()
File "/usr/local/lib/python3.7/dist-packages/torch/storage.py", line 161, in _load_from_bytes
self.optimizer.consolidate_state_dict()
File "/home/tiger/.local/lib/python3.7/site-packages/fairscale/optim/oss.py", line 362, in consolidate_state_dict
return torch.load(io.BytesIO(b))
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 608, in load
group=self.group,
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1743, in broadcast_object_list
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 743, in persistent_load
object_list[i] = _tensor_to_object(obj_view, obj_size)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1456, in _tensor_to_object
deserialized_objects[root_key] = restore_location(obj, location)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 155, in _cuda_deserialize
return storage_type(obj.size())
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/__init__.py", line 528, in _lazy_new
return _unpickler(io.BytesIO(buf)).load()
File "/usr/local/lib/python3.7/dist-packages/torch/storage.py", line 161, in _load_from_bytes
return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError : return torch.load(io.BytesIO(b))CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Solution
- When I change batch size of each GPU from 8 → 2. It works well.
- also when I disable the optimizer saving by commenting out
consolidate_state_dict
as well as the optimizer saving part, it works well.
Wonder how come the optimizer saving is taking a lot of GPU memory.