RobIt : PreTrain RoBERTa-base from scratch in Italian

Sheyz-max · June 28, 2021, 7:45pm

1. RobIt

The idea is to use the Italian portion of mC4 (which is about 590 GB) and OSCAR dataset(which is about 137 GB) to pre-train a RoBERTa-base model.

The model will be trained in Italian

ROBERTa-base

OSCAR-Italian, Italian Portion of multilingual C4 dataset

There are already Flax scripts to pre-train RoBERTa that we can easily use:

It is too much data. We need a way to reduce the amount of data to finish on time so we will use random sampling.
Achieving SOTA.

A Italian monolingual well performing model. We could optionally test the validity of our results at a later stage by fine-tuning the model on some downstream tasks.

https://arxiv.org/pdf/1907.11692.pdf
We found a model trained on RoBERTa but we plan on increasing the dataset with a different tokenisation.

prateekagrawal · June 28, 2021, 7:51pm

Count me in !!!

ruchi798 · June 28, 2021, 7:59pm

Count me in

yotanay · June 29, 2021, 4:35am

Would love to be a part of this, count me in too!

patrickvonplaten · June 29, 2021, 1:51pm

Great! Finalizing this project!

Topic		Replies	Views
Pretrain RoBERTa-large from scratch in Finnish Flax/JAX Projects	1	352	June 29, 2021
PreTrain RoBERTa from scratch in Indonesian Flax/JAX Projects	6	549	June 28, 2021
Pretrain RoBERTa-large from scratch in Swedish Flax/JAX Projects	2	1064	July 5, 2021
BERTIN: PreTrain RoBERTa-large from scratch in Spanish Flax/JAX Projects	23	2003	July 19, 2021
PreTrain RoBERTa from scratch in Thai Flax/JAX Projects	3	647	July 2, 2021