Training RoBERTa on a large corpus

proxyht · August 22, 2020, 2:07am

Hello. I’m trying to train a RoBERTa model on a 97GB corpus of text.

Should I tokenize the text on-the-fly, or should I precompute them to reduce CPU load? (Let’s say I have 200-300GB of RAM)

I did tried to tokenize the text on-the-fly, but due to some company shared resource restriction (shared CPU and RAM), the training speed is pretty slow. After some iterations, it froze for a litte bit and then continue. This reduce the normal training speed for 4-5 times.

Were I have to precompute the data, is there any way to efficiently work with them? I made an experiment where I tokenized the text and dump to a bunch of pickle files (total size 9GB), but when I tried to load them to memory to feed through the trainer, it cost me about 80-90GB of RAM, which is unresonably high.

Thank you.

gmihaila · August 22, 2020, 11:40am

What I did to in this case is to split the data in multiple chunks of size batch_size. And then use DataLoader getitem(id=chunk_file_id) to process a chunk file at a time and feed it to the model. I can write some dummy code for clarification.

valhalla · August 23, 2020, 5:07pm

nlp is what you are looking for . It’s the new datasets library from HF which allows you to memory map huge datasets and load them lazily. It also caches all the pre-processing you do, so the next time you load the data with same processing it just uses the cache instead of recomputing.

Just to see how efficient this is, it can load 17 GB+ English wikipedia dataset in just 9 MB of RAM.

go-inoue · August 25, 2020, 2:44pm

Hi,

I tried nlp but I got an error when I used it on a TPU node (https://github.com/huggingface/nlp/issues/532). Does the way I use it seem fine?

I modified line 131 in the original run_language_modeling.py as follows:

# line 131: return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
dataset = load_dataset("text", data_files=file_path, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                        truncation=True, max_length=args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
return dataset

donal · August 25, 2020, 3:28pm

Hey this isn’t working for me. Anytime I try to load in the Wiki it takes 17 gb.

I tried a few iterations of moving split or a different cofig no luck its still too large. It’s strange because with bookcorpus it works fine.

code:

from nlp import load_dataset
mem_before = psutil.Process(os.getpid()).memory_info().rss >> 20
dataset_wiki = load_dataset(“wikipedia”, “20200501.en”)[“train”]
mem_after = psutil.Process(os.getpid()).memory_info().rss >> 20
print(f"RAM memory used: {(mem_after - mem_before)} MB")

valhalla · August 25, 2020, 3:33pm

Pinging @lhoestq

Topic		Replies	Views
Data-prep for new portuguese RoBERTa from scratch Models	4	410	May 20, 2021
Training a LM from scratch on large corpus Beginners	0	376	August 10, 2020
Cost to fine tune large transformer models on the cloud? Beginners	1	1520	November 29, 2021
Pretraining RoBERTa from scratch breaks down when using tokenizer with smaller vocabulary Beginners	2	1677	March 7, 2021
Further pre-training the tokenizer? 🤗Tokenizers	0	821	April 30, 2022

Training RoBERTa on a large corpus

Related topics