How to train a language model from scratch when my dataset is bigger than RAM?

KerenzaDoxolodeo · July 8, 2020, 5:19pm

I intend to train a language model from scratch. I have been following the tutorial on Hugging Face’s blog. However, when I tried to use their google Colab’s code, I had a memory exceeded error. Their Google Colab’s code uses the LineByLineTextDataset. After digging through the source code, it turns out that LineByLineTextDatasets loads the entire dataset eagerly. No wonder why I had a Memory Exceeded error. My dataset is larger than my RAM capacity.

The article hints that if my dataset is bigger than my capacity, I could “opt to load and tokenize examples on the fly, rather than as a preprocessing step.” However, I’m not certain how to achieve this. I will be very grateful if others can points me to the right direction, especially if I can still use the transformers.Trainer.

thomwolf · July 8, 2020, 6:41pm

Do you want to use TensorFlow or PyTorch for the training?

valhalla · July 8, 2020, 6:42pm

Hi @Barik Here are few suggestion

You can modify the LineByLineTextDataset to load each line at a time instead of loading the whole file. So basically your __getitem__ should return a single text example
At this line the examples are encoded into vectors. You can disable this and move the encoding to collate_batch function of DataCollatorForLanguageModeling.
In the collate function you can receive a List[str] instead of List[torch.Tensor], so take the list of text examples, encode them and then do the masking

I think this will slow down the training, but you can try. Hope this helps.

Assuming you are using pytorch

gerardo · July 8, 2020, 7:10pm

I’m interested in pytorch

KerenzaDoxolodeo · July 8, 2020, 7:42pm

To be honest, I am not certain. The blog post hints that we do not need Tensorflow and I successfully reproduced the code after uninstalling Tensorflow in Google Colab (hinting that we’re using PyTorch). However, when I try to reproduce the code in Kaggle, the code only works if I do not uninstall Tensorflow, but maybe that’s because the Kaggle version is reproduced after HF 3.0 is rolled and something internally has changed and now requires Tensorflow

KerenzaDoxolodeo · July 8, 2020, 7:45pm

That’s a helpful pointer. I do have a question. Do we know how does the Trainer.__getitem__ is normally accessed? If I know it is accessed incrementally from index=0 until the last index, then maybe I can cache 1 million sentences at a time. But if they access it randomly, then the operation will be very expensive I/O wise since I have to constantly move the pointers.

thomwolf · July 8, 2020, 7:50pm

@Batik so the easiest way to overcome all the RAM limitation will likely be to use the new nlp library which is specifically developed for that.

Jared, a community contributor, is actually adding right now a loader for text dataset which is probably exactly what you’ll need. You can follow the PR here: https://github.com/huggingface/nlp/pull/356

I’ll ping him as well (he’s not yet on the forum I think).

jaredtnielsen · July 8, 2020, 8:06pm

Ran into the same issue as you - TF datasets are greedy by default unless you use tf.data.Dataset.from_generator(), but that can cause performance issues if you’re not careful. I recently opened a PR to the huggingface/nlp library which maps a .txt file into sharded Apache Arrow formats, which can then be read lazily from disk. So after everything gets merged, you could do something like:

from nlp import load_dataset

dset = load_dataset("text", "/path/to/file.txt")["train"]
dset.set_format("tensorflow", columns=["text"])

def dataset_gen():
    for ex in dset:
        yield ex
tf_dataset = tf.data.Dataset.from_generator(dataset_gen, output_types={"text": tf.string})

I haven’t tested this exact code, but you get the gist. This should be relatively equivalent to LineByLineTextDataset. Follow https://github.com/huggingface/nlp/issues/315 for more info about dataset efficiency.

thomwolf · July 8, 2020, 8:13pm

To add to @jaredtnielsen answers I think you shouldn’t have any issue of memory if you are using PyTorch + nlp, it will be fully lazy loading from the drive, almost nothing in RAM (just your current batch, plus the model of course)

mrm8488 · July 9, 2020, 12:16am

Good to know it!

mrm8488 · July 31, 2020, 6:14pm

I am going to train a RoBERTa like model on almost 200 GB of Spanish corpus. I will use PyTorch + nlp. Wish me luck!

seyonec · September 4, 2020, 5:16pm

Best of luck, @mrm8488! Would you be able to share more details on how you did so with the huggingface/NLP library?

mrm8488 · September 7, 2020, 8:26am

I am on that. When I finish it I will share the whole process.

donal · September 10, 2020, 8:19pm

Please share ASAP

seyonec · September 11, 2020, 3:01am

Please do, thanks!

aidad · September 11, 2020, 9:01pm

Hello! I just wonder if there is an example for training a MLM with TensorFlow + TPU. I don’t want to train from scratch, but rather train for some additional steps on custom data from an existing model. Thank you.

seyonec · September 16, 2020, 5:06am

Any updates on this?

valhalla · September 16, 2020, 12:21pm

@mrm8488 we are eagerly waiting for your experiments

kouohhashi · September 17, 2020, 6:08am

Thanks, I’m having the save problem.

BTW, I have a question.
What is the best practice to train a model from scratch with large datasets like entire Wikipedia?

Tweak LineByLineTextDataset ?
or use other Dataset classes ?
or just use more powerful machine?

mrm8488 · September 18, 2020, 3:24pm

Hi guys, you can check the core idea under the hood in this discussion: https://github.com/huggingface/datasets/issues/610#issuecomment-691672919

Topic		Replies	Views
Huggingface distilbert-base-uncased-finetuned-sst-2-english runs out of ram with only a few kb? Beginners	0	373	May 12, 2022
Pre-Training From Scratch 🤗Transformers	0	1003	October 6, 2021
Code to train a model on my dataset is not working Beginners	0	507	October 24, 2023
Pre-training a language model on a large dataset 🤗Transformers	5	3876	March 15, 2022
[Beginner] ClassificationModel Running out of Memory, long training Epochs 🤗Transformers	6	1803	January 4, 2021

How to train a language model from scratch when my dataset is bigger than RAM?

Related topics