Pre-training a language model on a large dataset

Hi,

I’m getting a memory error when I run the example code for language modeling. I’m interested in pre-training a RoBERTa model using a 25GB text data on a virtual machine with a v3-8 TPU on Google Cloud Platform.

I’m using the following command with transformers/examples/xla_spawn.py and transformers/examples/run_language_modeling.py.

python xla_spawn.py --num_cores 8 \
run_language_modeling.py \
    --output_dir=[*****] \
    --config_name=[*****] \
    --tokenizer_name=[*****] \
    --do_train \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 128 \
    --learning_rate 6e-4 \
    --weight_decay 0.01 \
    --adam_epsilon 1e-6 \
    --adam_beta1 0.9 \
    --adam_beta2 0.98 \
    --max_steps 500_000 \
    --warmup_steps 24_000 \
    --save_total_limit 5 \
    --save_steps=100_000 \
    --block_size=512 \
    --train_data_file=[*****] \
    --mlm \
    --line_by_line

When I run this, I get the following error.

08/20/2020 15:21:07 - INFO - transformers.data.datasets.language_modeling -   Creating features from dataset file at [*****]
Traceback (most recent call last):
  File "xla_spawn.py", line 72, in <module>
    main()
  File "xla_spawn.py", line 68, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
    start_method=start_method)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGKILL

It looks like the script gets killed while it’s loading the training data here.

with open(file_path, encoding="utf-8") as f:
    lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

When I run the above block of code separately with transformers/examples/xla_spawn.py, I get an error.

Traceback (most recent call last):
  File "xla_spawn.py", line 72, in <module>
    main()
  File "xla_spawn.py", line 68, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
    start_method=start_method)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGKILL

When I run the above block of code separately using n1-highmem-16 (16 vCPUs, 104 GB memory) without TPU, I still get an error.

Traceback (most recent call last):
  File "debug_load.py", line 7, in <module>
    lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError

Has anyone successfully reproduced the original RoBERTa model or pretrained a language model with a large dataset using Huggingface’s transformers (with TPU)? If so, what are the specifications of your machine? Has this code (transformers/examples/run_language_modeling.py) tested on a large dataset?

Same problem here. It seems that run_language_modeling.py is not able to deal with very large files. Any help?! @valhalla @lhoestq
Thanks

Hi @user123
If you have large dataset, you’ll need to write your own dataset to lazy load examples. Also consider using datasets library. It allows you to memory map dataset and cache the processed data, by memory mapping it won’t take too much RAM and by caching you can reuse the processed dataset.

Hi @valhalla
Thanks for your suggestion. I modified the get_dataset function in run_language_modeling.py using datasets as explained here

dataset = load_dataset('text', data_files=file_path, split='train')
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                               truncation=True, max_length=args.block_size,
                                               ), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])

it runs without error on TPU but it is superslow. I used wikitext103 to test the code. Each training step takes less than 1 sec in the original code, but with datasets it takes more than 60 sec. Am I missing something or this is expected?