Training a LM from scratch on large corpus

Hi guys, I want to train a RoBERTa like model for Spanish on a corpus of 200 GB approx. I have already trained one following the original RoBERTa implementation availiable on github but only on 20GB.
Now, I want to use HF libraries/tooling and ofc I would like to use HF/nlp. I have been testing on local with different configurations. I got one that “works” but after some tranining steps I always get and error (CUBLAS ERROR).
I leave you here the colab in order to reproduce everything:
https://colab.research.google.com/drive/15m7F30TIdAw0wtyVM9_MSXmGgt6L7W0M?usp=sharing

Thanks in advance!

1 Like