Pretrain RoBERTa-large from scratch in Finnish

Finnish RoBERTA-large

The project idea is somewhat identical to the one for Pretraining Roberta in Spanish but instead using the Finnish datasets

The idea is to use the Finnish portion of mC4 (which roughly amounts for 100GB of uncompressed text) to pre-train a RoBERTa-large model, first on 256 sequence length and then on 512. Also we might try few other datasets (OSCAR, STT, Yle news)

2. Language

The model will be trained in Finnish.

3. Model

RoBERTa-large
(Maybe also some other models if time is left)

4. Datasets

Finnish portion of mC4 of about 100gb
Yle news dataset
STT news

5. Training scripts

There are already Flax scripts to pre-train RoBERTa that we can easily use:

https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)

6. Challenges

Will the data be enough to be able to train a good model?

7. Desired project outcome

A Finnish monolingual well performing model on the usual benchmarks. We hope to beat SOTA model for this task. Current SOTA is TurkuNLP/finbert

8. Reads

- https://arxiv.org/pdf/1907.11692.pdf
- https://arxiv.org/pdf/1911.00359.pdf
- https://www.aclweb.org/anthology/W11-2123.pdf

Great! Finalizing this project :slight_smile: