Training RoBERTa on a large corpus

Hello. I’m trying to train a RoBERTa model on a 97GB corpus of text.

Should I tokenize the text on-the-fly, or should I precompute them to reduce CPU load? (Let’s say I have 200-300GB of RAM)

I did tried to tokenize the text on-the-fly, but due to some company shared resource restriction (shared CPU and RAM), the training speed is pretty slow. After some iterations, it froze for a litte bit and then continue. This reduce the normal training speed for 4-5 times.

Were I have to precompute the data, is there any way to efficiently work with them? I made an experiment where I tokenized the text and dump to a bunch of pickle files (total size 9GB), but when I tried to load them to memory to feed through the trainer, it cost me about 80-90GB of RAM, which is unresonably high.

Thank you.

What I did to in this case is to split the data in multiple chunks of size batch_size. And then use DataLoader getitem(id=chunk_file_id) to process a chunk file at a time and feed it to the model. I can write some dummy code for clarification.

nlp is what you are looking for :slight_smile:. It’s the new datasets library from HF which allows you to memory map huge datasets and load them lazily. It also caches all the pre-processing you do, so the next time you load the data with same processing it just uses the cache instead of recomputing.

Just to see how efficient this is, it can load 17 GB+ English wikipedia dataset in just 9 MB of RAM. :fire:

1 Like

Hi,

I tried nlp but I got an error when I used it on a TPU node (https://github.com/huggingface/nlp/issues/532). Does the way I use it seem fine?

I modified line 131 in the original run_language_modeling.py as follows:

# line 131: return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
dataset = load_dataset("text", data_files=file_path, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                        truncation=True, max_length=args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
return dataset
1 Like

Hey this isn’t working for me. Anytime I try to load in the Wiki it takes 17 gb.

I tried a few iterations of moving split or a different cofig no luck its still too large. It’s strange because with bookcorpus it works fine.

code:

from nlp import load_dataset
mem_before = psutil.Process(os.getpid()).memory_info().rss >> 20
dataset_wiki = load_dataset(“wikipedia”, “20200501.en”)[“train”]
mem_after = psutil.Process(os.getpid()).memory_info().rss >> 20
print(f"RAM memory used: {(mem_after - mem_before)} MB")

Pinging @lhoestq