How to train a language model from scratch when my dataset is bigger than RAM?

I intend to train a language model from scratch. I have been following the tutorial on Hugging Face’s blog. However, when I tried to use their google Colab’s code, I had a memory exceeded error. Their Google Colab’s code uses the LineByLineTextDataset. After digging through the source code, it turns out that LineByLineTextDatasets loads the entire dataset eagerly. No wonder why I had a Memory Exceeded error. My dataset is larger than my RAM capacity.

The article hints that if my dataset is bigger than my capacity, I could “opt to load and tokenize examples on the fly, rather than as a preprocessing step.” However, I’m not certain how to achieve this. I will be very grateful if others can points me to the right direction, especially if I can still use the transformers.Trainer.

Do you want to use TensorFlow or PyTorch for the training?

Hi @Barik Here are few suggestion

  1. You can modify the LineByLineTextDataset to load each line at a time instead of loading the whole file. So basically your __getitem__ should return a single text example
  2. At this line the examples are encoded into vectors. You can disable this and move the encoding to collate_batch function of DataCollatorForLanguageModeling.
  3. In the collate function you can receive a List[str] instead of List[torch.Tensor], so take the list of text examples, encode them and then do the masking

I think this will slow down the training, but you can try. Hope this helps.

Assuming you are using pytorch :smile:

4 Likes

I’m interested in pytorch

1 Like

To be honest, I am not certain. The blog post hints that we do not need Tensorflow and I successfully reproduced the code after uninstalling Tensorflow in Google Colab (hinting that we’re using PyTorch). However, when I try to reproduce the code in Kaggle, the code only works if I do not uninstall Tensorflow, but maybe that’s because the Kaggle version is reproduced after HF 3.0 is rolled and something internally has changed and now requires Tensorflow

That’s a helpful pointer. I do have a question. Do we know how does the Trainer.__getitem__ is normally accessed? If I know it is accessed incrementally from index=0 until the last index, then maybe I can cache 1 million sentences at a time. But if they access it randomly, then the operation will be very expensive I/O wise since I have to constantly move the pointers.

@Batik so the easiest way to overcome all the RAM limitation will likely be to use the new :hugs:nlp library which is specifically developed for that.

Jared, a community contributor, is actually adding right now a loader for text dataset which is probably exactly what you’ll need. You can follow the PR here: https://github.com/huggingface/nlp/pull/356

I’ll ping him as well (he’s not yet on the forum I think).

2 Likes

Ran into the same issue as you - TF datasets are greedy by default unless you use tf.data.Dataset.from_generator(), but that can cause performance issues if you’re not careful. I recently opened a PR to the huggingface/nlp library which maps a .txt file into sharded Apache Arrow formats, which can then be read lazily from disk. So after everything gets merged, you could do something like:

from nlp import load_dataset

dset = load_dataset("text", "/path/to/file.txt")["train"]
dset.set_format("tensorflow", columns=["text"])

def dataset_gen():
    for ex in dset:
        yield ex
tf_dataset = tf.data.Dataset.from_generator(dataset_gen, output_types={"text": tf.string})

I haven’t tested this exact code, but you get the gist. This should be relatively equivalent to LineByLineTextDataset. Follow https://github.com/huggingface/nlp/issues/315 for more info about dataset efficiency.

To add to @jaredtnielsen answers I think you shouldn’t have any issue of memory if you are using PyTorch + :hugs:nlp, it will be fully lazy loading from the drive, almost nothing in RAM (just your current batch, plus the model of course)

2 Likes

Good to know it!

I am going to train a RoBERTa like model on almost 200 GB of Spanish corpus. I will use PyTorch + :hugs:nlp. Wish me luck! :wink:

2 Likes

Best of luck, @mrm8488! Would you be able to share more details on how you did so with the huggingface/NLP library?

I am on that. When I finish it I will share the whole process.

1 Like

Please share ASAP

Please do, thanks! :slight_smile:

Hello! I just wonder if there is an example for training a MLM with TensorFlow + TPU. I don’t want to train from scratch, but rather train for some additional steps on custom data from an existing model. Thank you.

Any updates on this?

@mrm8488 we are eagerly waiting for your experiments :star_struck:

Thanks, I’m having the save problem.

BTW, I have a question.
What is the best practice to train a model from scratch with large datasets like entire Wikipedia?

Tweak LineByLineTextDataset ?
or use other Dataset classes ?
or just use more powerful machine?

Hi guys, you can check the core idea under the hood in this discussion: https://github.com/huggingface/datasets/issues/610#issuecomment-691672919

2 Likes