Getting error while resuming the training with a single GPU

Hello all,

I’m new to the Huggingface library though I’m familiar with the deep learning in general and some of the DL libraries.

so I’m trying to train a BERT model from scratch with a custom text. I followed the instructions here and managed to get the training going with 2 GPUs. However, later on, I had to stop the training and resume with a single GPU. Then, I got the error below. I looked up this forum, StackOverflow etc. but couldn’t find any solution so far.

The environment info is as follows (the result of the command transformers-cli env):

  • transformers version: 4.10.0.dev0
  • Platform: Linux-5.4.0-65-generic-x86_64-with-glibc2.10
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.8.1+cu111 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: First yes, then no (I guess)

I created a big text file where I got a sentence at each line and a newline in between documents (I guess it’s called the NSP format) and I’m using a tokenizer from the Huggingface repo.
So I started the training with this command:

python run_mlm.py --model_type bert --train_file <path_to_my_train_file> --validation_file <path_to_my_val_file> --do_train --do_eval --output_dir /my/local/path/test-mlm --tokenizer_name dbmdz/bert-base-turkish-cased --cache_dir /my/local/cache/dir --line_by_line --save_total_limit 2 --logging_dir /my/log/dir

I have two GPUs and this command was using both of them as intended. Then, when I stopped the training and wanted to resume with one GPU with the following command (Basically I just added CUDA_VISIBLE_DEVICES=1 at the beginning):

CUDA_VISIBLE_DEVICES=1 python run_mlm.py --model_type bert --train_file <path_to_my_train_file> --validation_file <path_to_my_val_file> --do_train --do_eval --output_dir /my/local/path/test-mlm --tokenizer_name dbmdz/bert-base-turkish-cased --cache_dir /my/local/cache/dir --line_by_line --save_total_limit 2 --logging_dir /my/log/dir

First, it skipped some batches to resume from the last checkpoint and threw this error trace while loading it:

Traceback (most recent call last):
File “run_mlm.py”, line 550, in
main()
File “run_mlm.py”, line 501, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “/my/local/path/venv/lib/python3.8/site-packages/transformers/trai
ner.py”, line 1262, in train
self._load_rng_state(resume_from_checkpoint)
File “/my/local/path/venv/lib/python3.8/site-packages/transformers/trai
ner.py”, line 1477, in _load_rng_state
torch.cuda.random.set_rng_state_all(checkpoint_rng_state[“cuda”])
File “/my/local/path/venv/lib/python3.8/site-packages/torch/cuda/random
.py”, line 73, in set_rng_state_all
set_rng_state(state, i)
File “/my/local/path/venv/lib/python3.8/site-packages/torch/cuda/random
.py”, line 64, in set_rng_state
_lazy_call(cb)
File “/my/local/path/venv/lib/python3.8/site-packages/torch/cuda/__init
__.py”, line 114, in _lazy_call
callable()
File “/my/local/path/venv/lib/python3.8/site-packages/torch/cuda/random
.py”, line 61, in cb
default_generator = torch.cuda.default_generators[idx]
IndexError: tuple index out of range

It works with no problem if I remove CUDA_VISIBLE_DEVICES=1 but then it uses both of the GPUs. I had a desperate attempt of doubling per_device_train_batch_size (default is 8) hoping it would balance out the single GPU index error:

CUDA_VISIBLE_DEVICES=1 python run_mlm.py --model_type bert --train_file <path_to_my_train_file> --validation_file <path_to_my_val_file> --do_train --do_eval --output_dir /my/local/path/test-mlm --tokenizer_name dbmdz/bert-base-turkish-cased --cache_dir /my/local/cache/dir --line_by_line --save_total_limit 2 --logging_dir /my/log/dir --per_device_train_batch_size 16

This didn’t work unfortunately. So I got stuck at this point.
Any help is appreciated. Thanks!