Hello. Iām trying to train a RoBERTa model on a 97GB corpus of text.
Should I tokenize the text on-the-fly, or should I precompute them to reduce CPU load? (Letās say I have 200-300GB of RAM)
I did tried to tokenize the text on-the-fly, but due to some company shared resource restriction (shared CPU and RAM), the training speed is pretty slow. After some iterations, it froze for a litte bit and then continue. This reduce the normal training speed for 4-5 times.
Were I have to precompute the data, is there any way to efficiently work with them? I made an experiment where I tokenized the text and dump to a bunch of pickle files (total size 9GB), but when I tried to load them to memory to feed through the trainer, it cost me about 80-90GB of RAM, which is unresonably high.
What I did to in this case is to split the data in multiple chunks of size batch_size. And then use DataLoader getitem(id=chunk_file_id) to process a chunk file at a time and feed it to the model. I can write some dummy code for clarification.
nlp is what you are looking for . Itās the new datasets library from HF which allows you to memory map huge datasets and load them lazily. It also caches all the pre-processing you do, so the next time you load the data with same processing it just uses the cache instead of recomputing.
Just to see how efficient this is, it can load 17 GB+ English wikipedia dataset in just 9 MB of RAM.