Rather than training separate models for Swedish, Norwegian, Danish and Icelandic [1][2][3], we could probably produce a better model by pretraining a Scandinavian model and then finetuning to one of the four, considering how similar the four languages are (in its written form that is, cough cough Danish).
We can train a RoBERTa-large model on the combined mC4 dataset, containing 386 GB uncompressed text (179 GB Swedish, 107 GB Danish and 100 GB Norwegian). Furthermore, there are gigaword datasets in both Swedish, Danish and Icelandic that we could use. As suggested in [1], we could start training the model with a sequence length of 128, then 256 and lastly 512.
2. Language
The model will be trained in Swedish, Danish, Norwegian and Icelandic.
3. Model
RoBERTa-large
4. Datasets
Swedish:
mC4 (179 GB)
Gigaword [4] (~9 GB compressed)
Danish
mC4 (107 GB)
Gigaword [5] (~2 GB compressed)
Norwegian
mC4 (100 GB)
Icelandic
mC4 (9 GB)
Gigaword [6] (~14 GB compressed)
5. Training scripts
There are already Flax scripts to pre-train RoBERTa that we can easily use:
Great idea! This will be super helpful for Scandinavian SMEs which often operate across Nordic country borders and want to support multiple Scandinavian languages.
Great idea. There are a few other dataset resources as well here, like the Wikipedia dumps and some Reddit/Twitter datasets that might increase the variety/quality of the data. I do however think training a large RoBERTa on a v3-8 is not realistic. Memory limitations will force the batch size will be too small, even on the 128 sequences. Using a base-model architecture will probably lead to a better result.
And the tokenizer is ByteLevelBPETokenizer and not WordPiece so I have very little to add =D
As I understand byte pair this is clever because we have very common spelling substitutions between languages e.g. s/c, æ/ä, ø/ö, and å/aa.
With regard to model size there’s obviously some benefits from multiples of 8 (or 32) but could you maybe also take graphics cards RAM sizes into account? Something that can sit comfortably on 8 GB and 12 GB would play pretty well with the current (30) and previous generation of Nvidia cards (too bad for my 2080 Ti with 11 GB).
And do you have anything planned to figure out if Icelandic will actually benefit from this? It might be too small a data set and too different a language?
I am joining this project but do also initiate a project trying to use T5 for translating between Norwegian Bokmål and Norwegian Nynorsk (Model to translate between Norwegian Bokmål and Norwegian Nynorsk). There might be some overlaps here, at least since we can some of the same corpus. If anyone in addition also want to also work on a Scandinavian Seq2seq model, show your interest in that thread. (Unfortunately I do not think a Scandinavian parallell corpus exists…)
As Patrick mentioned above, this project is going to happen!
I’ve set up a repository with a preliminary template here. This includes training scripts and such, and in the readme I’ve included some helpful links related to our project and to the community week in general.
I intended to include Faroese, but the data is just so scarce. I mean, we might be able to include it, but it seems like there is less than 1GB of data, so I’m guessing it won’t be performing that well. But we could try! Considering the similarity between Faroese and Icelandic (from what I’ve heard), the model might be decent for Faroese as well
We’re thinking about meeting up (virtually) to say hi to each other and also do some initial planning: what needs to be done, who wants to do what, and so on. What times are you available?