Pre-training a language model on a large dataset

go-inoue · August 20, 2020, 4:15pm

Hi,

I’m getting a memory error when I run the example code for language modeling. I’m interested in pre-training a RoBERTa model using a 25GB text data on a virtual machine with a v3-8 TPU on Google Cloud Platform.

I’m using the following command with transformers/examples/xla_spawn.py and transformers/examples/run_language_modeling.py.

python xla_spawn.py --num_cores 8 \
run_language_modeling.py \
    --output_dir=[*****] \
    --config_name=[*****] \
    --tokenizer_name=[*****] \
    --do_train \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 128 \
    --learning_rate 6e-4 \
    --weight_decay 0.01 \
    --adam_epsilon 1e-6 \
    --adam_beta1 0.9 \
    --adam_beta2 0.98 \
    --max_steps 500_000 \
    --warmup_steps 24_000 \
    --save_total_limit 5 \
    --save_steps=100_000 \
    --block_size=512 \
    --train_data_file=[*****] \
    --mlm \
    --line_by_line

When I run this, I get the following error.

08/20/2020 15:21:07 - INFO - transformers.data.datasets.language_modeling -   Creating features from dataset file at [*****]
Traceback (most recent call last):
  File "xla_spawn.py", line 72, in <module>
    main()
  File "xla_spawn.py", line 68, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
    start_method=start_method)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGKILL

It looks like the script gets killed while it’s loading the training data here.

with open(file_path, encoding="utf-8") as f:
    lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

When I run the above block of code separately with transformers/examples/xla_spawn.py, I get an error.

Traceback (most recent call last):
  File "xla_spawn.py", line 72, in <module>
    main()
  File "xla_spawn.py", line 68, in main
    xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
    start_method=start_method)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGKILL

When I run the above block of code separately using n1-highmem-16 (16 vCPUs, 104 GB memory) without TPU, I still get an error.

Traceback (most recent call last):
  File "debug_load.py", line 7, in <module>
    lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError

Has anyone successfully reproduced the original RoBERTa model or pretrained a language model with a large dataset using Huggingface’s transformers (with TPU)? If so, what are the specifications of your machine? Has this code (transformers/examples/run_language_modeling.py) tested on a large dataset?

user123 · October 15, 2020, 5:07pm

Same problem here. It seems that run_language_modeling.py is not able to deal with very large files. Any help?! @valhalla @lhoestq
Thanks

valhalla · October 16, 2020, 5:27pm

Hi @user123
If you have large dataset, you’ll need to write your own dataset to lazy load examples. Also consider using datasets library. It allows you to memory map dataset and cache the processed data, by memory mapping it won’t take too much RAM and by caching you can reuse the processed dataset.

user123 · October 21, 2020, 5:00pm

Hi @valhalla
Thanks for your suggestion. I modified the get_dataset function in run_language_modeling.py using datasets as explained here

dataset = load_dataset('text', data_files=file_path, split='train')
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                               truncation=True, max_length=args.block_size,
                                               ), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])

it runs without error on TPU but it is superslow. I used wikitext103 to test the code. Each training step takes less than 1 sec in the original code, but with datasets it takes more than 60 sec. Am I missing something or this is expected?

clk · January 14, 2021, 5:35pm

have you solved this problem? I think i met the same problem as yours…

deathcrush · March 15, 2022, 8:13am

@valhalla is this expected? It seems to be a performance issue? What is the go to method for training a model on a large amount of data over multiple GPUs? Will the huggingface datasets work with DDP “out of the box”?

Topic		Replies	Views
Run_qa.py with custom dataset seems to expect batch size of 1000 but receives batch size of 1362 Beginners	0	499	March 25, 2022
Run_mlm.py cuda error memory after resuming a training 🤗Transformers	4	2903	April 21, 2021
Fine-tune transformers for language model Beginners	2	662	August 14, 2022
Fine-tuning XLM-RoBERTa for binary sentiment classification Beginners	1	1433	November 4, 2021
Training RoBERTa from scratch: error? 🤗Transformers	0	586	August 26, 2021

Pre-training a language model on a large dataset

Related topics