Finnish RoBERTA-large
The project idea is somewhat identical to the one for Pretraining Roberta in Spanish but instead using the Finnish datasets
The idea is to use the Finnish portion of mC4 (which roughly amounts for 100GB of uncompressed text) to pre-train a RoBERTa-large model, first on 256 sequence length and then on 512. Also we might try few other datasets (OSCAR, STT, Yle news)
2. Language
The model will be trained in Finnish.
3. Model
RoBERTa-large
(Maybe also some other models if time is left)
4. Datasets
Finnish portion of mC4 of about 100gb
Yle news dataset
STT news
5. Training scripts
There are already Flax scripts to pre-train RoBERTa that we can easily use:
https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
6. Challenges
Will the data be enough to be able to train a good model?
7. Desired project outcome
A Finnish monolingual well performing model on the usual benchmarks. We hope to beat SOTA model for this task. Current SOTA is TurkuNLP/finbert
8. Reads
- https://arxiv.org/pdf/1907.11692.pdf
- https://arxiv.org/pdf/1911.00359.pdf
- https://www.aclweb.org/anthology/W11-2123.pdf