Pretrain RoBERTa-large from scratch in Swedish

birgermoell · June 28, 2021, 9:34am

Roberta Swedish

The project idea is identical to the one for Pretraining Roberta in Spanish but instead using the Swedish dataset.

The idea is to use the Swedish portion of mC4 (which roughly amounts for 100GB of uncompressed text) to pre-train a RoBERTa-large model, first on 256 sequence length and then on 512. It might be a good idea to start in 128, as suggested here .

2. Language

The model will be trained in Swedish.

3. Model

RoBERTa-large

4. Datasets

Swedish portion of mC4 of about 100gb of uncompressed data.

5. Training scripts

There are already Flax scripts to pre-train RoBERTa that we can easily use:

https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)

6. Challenges

Will the data be enough to be able to train a good model?

7. Desired project outcome

A Swedish monolingual well performing model on the usual benchmarks.

8. Reads

- https://arxiv.org/pdf/1907.11692.pdf
- https://arxiv.org/pdf/1911.00359.pdf
- https://www.aclweb.org/anthology/W11-2123.pdf

saattrupdan · June 28, 2021, 5:38pm

Just a suggestion: How about we pretrain a common Scandinavian model instead, as our languages are so similar? I’ve made a project suggestion here: Scandinavian RoBERTa.

patrickvonplaten · July 5, 2021, 2:04pm

Officially defining this one as well

Topic		Replies	Views
Pretrain RoBERTa-large from scratch in Finnish Flax/JAX Projects	1	352	June 29, 2021
Scandinavian RoBERTa Flax/JAX Projects	30	2041	July 15, 2021
BERTIN: PreTrain RoBERTa-large from scratch in Spanish Flax/JAX Projects	23	2004	July 19, 2021
RobIt : PreTrain RoBERTa-base from scratch in Italian Flax/JAX Projects	4	478	June 29, 2021
PreTrain RoBERTa from scratch in Portuguese Flax/JAX Projects	16	2424	October 4, 2021