I am trying to follow the Hugging Face article " How to train a new language model from scratch using Transformers and Tokenizers". But colab session get crashed after using all available RAM. It happens when running the function which build the training Dataset.
I am using a Sinhala language dataset for this. Size of the dataset is about 250MB. I am using it through google drive.
This is the link of the Colab notebook : https://colab.research.google.com/drive/1o4kPVNZHEmT2BlqKeVzY5rNyaojAj1At?usp=sharing
Please go through this and tell me what I am missing here ?
Thanks in advance !!