How to train a language model from scratch when my dataset is bigger than RAM?

I intend to train a language model from scratch. I have been following the tutorial on Hugging Face’s blog. However, when I tried to use their google Colab’s code, I had a memory exceeded error. Their Google Colab’s code uses the LineByLineTextDataset. After digging through the source code, it turns out that LineByLineTextDatasets loads the entire dataset eagerly. No wonder why I had a Memory Exceeded error. My dataset is larger than my RAM capacity.

The article hints that if my dataset is bigger than my capacity, I could “opt to load and tokenize examples on the fly, rather than as a preprocessing step.” However, I’m not certain how to achieve this. I will be very grateful if others can points me to the right direction, especially if I can still use the transformers.Trainer.

Do you want to use TensorFlow or PyTorch for the training?

Hi @Barik Here are few suggestion

  1. You can modify the LineByLineTextDataset to load each line at a time instead of loading the whole file. So basically your __getitem__ should return a single text example
  2. At this line the examples are encoded into vectors. You can disable this and move the encoding to collate_batch function of DataCollatorForLanguageModeling.
  3. In the collate function you can receive a List[str] instead of List[torch.Tensor], so take the list of text examples, encode them and then do the masking

I think this will slow down the training, but you can try. Hope this helps.

Assuming you are using pytorch :smile:


I’m interested in pytorch

1 Like

To be honest, I am not certain. The blog post hints that we do not need Tensorflow and I successfully reproduced the code after uninstalling Tensorflow in Google Colab (hinting that we’re using PyTorch). However, when I try to reproduce the code in Kaggle, the code only works if I do not uninstall Tensorflow, but maybe that’s because the Kaggle version is reproduced after HF 3.0 is rolled and something internally has changed and now requires Tensorflow

That’s a helpful pointer. I do have a question. Do we know how does the Trainer.__getitem__ is normally accessed? If I know it is accessed incrementally from index=0 until the last index, then maybe I can cache 1 million sentences at a time. But if they access it randomly, then the operation will be very expensive I/O wise since I have to constantly move the pointers.

@Batik so the easiest way to overcome all the RAM limitation will likely be to use the new :hugs:nlp library which is specifically developed for that.

Jared, a community contributor, is actually adding right now a loader for text dataset which is probably exactly what you’ll need. You can follow the PR here:

I’ll ping him as well (he’s not yet on the forum I think).


Ran into the same issue as you - TF datasets are greedy by default unless you use, but that can cause performance issues if you’re not careful. I recently opened a PR to the huggingface/nlp library which maps a .txt file into sharded Apache Arrow formats, which can then be read lazily from disk. So after everything gets merged, you could do something like:

from nlp import load_dataset

dset = load_dataset("text", "/path/to/file.txt")["train"]
dset.set_format("tensorflow", columns=["text"])

def dataset_gen():
    for ex in dset:
        yield ex
tf_dataset =, output_types={"text": tf.string})

I haven’t tested this exact code, but you get the gist. This should be relatively equivalent to LineByLineTextDataset. Follow for more info about dataset efficiency.

To add to @jaredtnielsen answers I think you shouldn’t have any issue of memory if you are using PyTorch + :hugs:nlp, it will be fully lazy loading from the drive, almost nothing in RAM (just your current batch, plus the model of course)


Good to know it!

I am going to train a RoBERTa like model on almost 200 GB of Spanish corpus. I will use PyTorch + :hugs:nlp. Wish me luck! :wink: