Rather than training separate models for Swedish, Norwegian, Danish and Icelandic , we could probably produce a better model by pretraining a Scandinavian model and then finetuning to one of the four, considering how similar the four languages are (in its written form that is, cough cough Danish).
We can train a RoBERTa-large model on the combined mC4 dataset, containing 386 GB uncompressed text (179 GB Swedish, 107 GB Danish and 100 GB Norwegian). Furthermore, there are gigaword datasets in both Swedish, Danish and Icelandic that we could use. As suggested in , we could start training the model with a sequence length of 128, then 256 and lastly 512.
The model will be trained in Swedish, Danish, Norwegian and Icelandic.
- mC4 (179 GB)
- Gigaword  (~9 GB compressed)
- mC4 (107 GB)
- Gigaword  (~2 GB compressed)
- mC4 (100 GB)
- mC4 (9 GB)
- Gigaword  (~14 GB compressed)
There are already Flax scripts to pre-train RoBERTa that we can easily use:
- Will the data be enough to be able to train a good model?
- Will all the languages be well-represented?
A Scandinavian language model which performs well on the usual benchmarks, on each of the four languages.
 The Swedish Culturomics Gigaword Corpus | Språkbanken Text
 Icelandic Gigaword Corpus