Pretrain RoBERTa-large from scratch in Finnish

RASMUS · June 28, 2021, 11:06am

Finnish RoBERTA-large

The project idea is somewhat identical to the one for Pretraining Roberta in Spanish but instead using the Finnish datasets

The idea is to use the Finnish portion of mC4 (which roughly amounts for 100GB of uncompressed text) to pre-train a RoBERTa-large model, first on 256 sequence length and then on 512. Also we might try few other datasets (OSCAR, STT, Yle news)

2. Language

The model will be trained in Finnish.

3. Model

RoBERTa-large
(Maybe also some other models if time is left)

4. Datasets

Finnish portion of mC4 of about 100gb
Yle news dataset
STT news

5. Training scripts

There are already Flax scripts to pre-train RoBERTa that we can easily use:

https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)

6. Challenges

Will the data be enough to be able to train a good model?

7. Desired project outcome

A Finnish monolingual well performing model on the usual benchmarks. We hope to beat SOTA model for this task. Current SOTA is TurkuNLP/finbert

8. Reads

- https://arxiv.org/pdf/1907.11692.pdf
- https://arxiv.org/pdf/1911.00359.pdf
- https://www.aclweb.org/anthology/W11-2123.pdf

patrickvonplaten · June 29, 2021, 1:39pm

Great! Finalizing this project

Topic		Replies	Views
Pretrain RoBERTa-large from scratch in Swedish Flax/JAX Projects	2	1064	July 5, 2021
Scandinavian RoBERTa Flax/JAX Projects	30	2041	July 15, 2021
BERTIN: PreTrain RoBERTa-large from scratch in Spanish Flax/JAX Projects	23	2004	July 19, 2021
RobIt : PreTrain RoBERTa-base from scratch in Italian Flax/JAX Projects	4	478	June 29, 2021
PreTrain RoBERTa from scratch in Indonesian Flax/JAX Projects	6	549	June 28, 2021