I intend to train a language model from scratch. I have been following the tutorial on Hugging Face’s blog. However, when I tried to use their google Colab’s code, I had a memory exceeded error. Their Google Colab’s code uses the LineByLineTextDataset. After digging through the source code, it turns out that LineByLineTextDatasets loads the entire dataset eagerly. No wonder why I had a Memory Exceeded error. My dataset is larger than my RAM capacity.
The article hints that if my dataset is bigger than my capacity, I could “opt to load and tokenize examples on the fly, rather than as a preprocessing step.” However, I’m not certain how to achieve this. I will be very grateful if others can points me to the right direction, especially if I can still use the transformers.Trainer.