Training a LM from scratch on large corpus

mrm8488 · August 10, 2020, 1:54pm

Hi guys, I want to train a RoBERTa like model for Spanish on a corpus of 200 GB approx. I have already trained one following the original RoBERTa implementation availiable on github but only on 20GB.
Now, I want to use HF libraries/tooling and ofc I would like to use HF/nlp. I have been testing on local with different configurations. I got one that “works” but after some tranining steps I always get and error (CUBLAS ERROR).
I leave you here the colab in order to reproduce everything:
https://colab.research.google.com/drive/15m7F30TIdAw0wtyVM9_MSXmGgt6L7W0M?usp=sharing

Thanks in advance!

Topic		Replies	Views
Training RoBERTa on a large corpus 🤗Transformers	5	3341	August 25, 2020
BERTIN: PreTrain RoBERTa-large from scratch in Spanish Flax/JAX Projects	23	2004	July 19, 2021
Pretrain RoBERTa-large from scratch in Swedish Flax/JAX Projects	2	1064	July 5, 2021
Pretrain RoBERTa-large from scratch in Finnish Flax/JAX Projects	1	352	June 29, 2021
PreTrain RoBERTa from scratch in Indonesian Flax/JAX Projects	6	549	June 28, 2021

Training a LM from scratch on large corpus

Related topics