Hi,
I’m getting a memory error when I run the example code for language modeling. I’m interested in pre-training a RoBERTa model using a 25GB text data on a virtual machine with a v3-8
TPU on Google Cloud Platform.
I’m using the following command with transformers/examples/xla_spawn.py
and transformers/examples/run_language_modeling.py
.
python xla_spawn.py --num_cores 8 \
run_language_modeling.py \
--output_dir=[*****] \
--config_name=[*****] \
--tokenizer_name=[*****] \
--do_train \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 128 \
--learning_rate 6e-4 \
--weight_decay 0.01 \
--adam_epsilon 1e-6 \
--adam_beta1 0.9 \
--adam_beta2 0.98 \
--max_steps 500_000 \
--warmup_steps 24_000 \
--save_total_limit 5 \
--save_steps=100_000 \
--block_size=512 \
--train_data_file=[*****] \
--mlm \
--line_by_line
When I run this, I get the following error.
08/20/2020 15:21:07 - INFO - transformers.data.datasets.language_modeling - Creating features from dataset file at [*****]
Traceback (most recent call last):
File "xla_spawn.py", line 72, in <module>
main()
File "xla_spawn.py", line 68, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
start_method=start_method)
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 108, in join
(error_index, name)
Exception: process 0 terminated with signal SIGKILL
It looks like the script gets killed while it’s loading the training data here.
with open(file_path, encoding="utf-8") as f:
lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
When I run the above block of code separately with transformers/examples/xla_spawn.py
, I get an error.
Traceback (most recent call last):
File "xla_spawn.py", line 72, in <module>
main()
File "xla_spawn.py", line 68, in main
xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
start_method=start_method)
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 108, in join
(error_index, name)
Exception: process 0 terminated with signal SIGKILL
When I run the above block of code separately using n1-highmem-16 (16 vCPUs, 104 GB memory)
without TPU, I still get an error.
Traceback (most recent call last):
File "debug_load.py", line 7, in <module>
lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError
Has anyone successfully reproduced the original RoBERTa model or pretrained a language model with a large dataset using Huggingface’s transformers (with TPU)? If so, what are the specifications of your machine? Has this code (transformers/examples/run_language_modeling.py
) tested on a large dataset?