Scandinavian RoBERTa

Scandinavian RoBERTa

Rather than training separate models for Swedish, Norwegian, Danish and Icelandic [1][2][3], we could probably produce a better model by pretraining a Scandinavian model and then finetuning to one of the four, considering how similar the four languages are (in its written form that is, cough cough Danish).

We can train a RoBERTa-large model on the combined mC4 dataset, containing 386 GB uncompressed text (179 GB Swedish, 107 GB Danish and 100 GB Norwegian). Furthermore, there are gigaword datasets in both Swedish, Danish and Icelandic that we could use. As suggested in [1], we could start training the model with a sequence length of 128, then 256 and lastly 512.

2. Language

The model will be trained in Swedish, Danish, Norwegian and Icelandic.

3. Model

RoBERTa-large

4. Datasets

  • Swedish:
    • mC4 (179 GB)
    • Gigaword [4] (~9 GB compressed)
  • Danish
    • mC4 (107 GB)
    • Gigaword [5] (~2 GB compressed)
  • Norwegian
    • mC4 (100 GB)
  • Icelandic
    • mC4 (9 GB)
    • Gigaword [6] (~14 GB compressed)

5. Training scripts

There are already Flax scripts to pre-train RoBERTa that we can easily use:

https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling

6. Challenges

  • Will the data be enough to be able to train a good model?
  • Will all the languages be well-represented?

7. Desired project outcome

A Scandinavian language model which performs well on the usual benchmarks, on each of the four languages.

8. Reads

[1] https://discuss.huggingface.co/t/pretrain-roberta-large-from-scratch-in-swedish
[2] https://discuss.huggingface.co/t/pretrain-gpt2-from-scratch-in-swedish
[3] https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-norwegian
[4] The Swedish Culturomics Gigaword Corpus | Språkbanken Text
[5] https://gigaword.dk/
[6] Icelandic Gigaword Corpus

19 Likes

This makes sense to me.

3 Likes

Great idea! A lot of people in the Scandinavian countries would benefit from this one

1 Like

Fantastic idea! I’d love to help if I may!

2 Likes

Amazing idea!

Since the languages share linguistic features such a model could provide much better language processing across the Scandinavian countries.

Furthermore, it would add a language model to the otherwise currently scarce number of models with Scandinavian language capabilities.

This needs to happen, and I would love to be a part of this project!

Awesome initiative @saattrupdan :hugs:

2 Likes

Great idea! This will be super helpful for Scandinavian SMEs which often operate across Nordic country borders and want to support multiple Scandinavian languages.

2 Likes

Great idea. There are a few other dataset resources as well here, like the Wikipedia dumps and some Reddit/Twitter datasets that might increase the variety/quality of the data. I do however think training a large RoBERTa on a v3-8 is not realistic. Memory limitations will force the batch size will be too small, even on the 128 sequences. Using a base-model architecture will probably lead to a better result.

I will be interested in contributing to this one.

2 Likes

Awesome - lots of details and links already! Finalizing this project :slight_smile:

2 Likes

This would be very useful!

2 Likes

This sounds like a great project! Love it!

And the tokenizer is ByteLevelBPETokenizer and not WordPiece so I have very little to add =D

As I understand byte pair this is clever because we have very common spelling substitutions between languages e.g. s/c, æ/ä, ø/ö, and å/aa.

With regard to model size there’s obviously some benefits from multiples of 8 (or 32) but could you maybe also take graphics cards RAM sizes into account? Something that can sit comfortably on 8 GB and 12 GB would play pretty well with the current (30) and previous generation of Nvidia cards (too bad for my 2080 Ti with 11 GB).

And do you have anything planned to figure out if Icelandic will actually benefit from this? It might be too small a data set and too different a language?

4 Likes

This is a great idea. I wish I had heard about it earlier and had more time to prepare for it. I wish you all the best of luck.

1 Like

I am joining this project but do also initiate a project trying to use T5 for translating between Norwegian Bokmål and Norwegian Nynorsk (Model to translate between Norwegian Bokmål and Norwegian Nynorsk). There might be some overlaps here, at least since we can some of the same corpus. If anyone in addition also want to also work on a Scandinavian Seq2seq model, show your interest in that thread. (Unfortunately I do not think a Scandinavian parallell corpus exists…)

3 Likes

Hi everyone :wave:

As Patrick mentioned above, this project is going to happen! :tada:

I’ve set up a repository with a preliminary template here. This includes training scripts and such, and in the readme I’ve included some helpful links related to our project and to the community week in general.

Some links I found useful:

Can y’all confirm whether you’d like to participate during the week 7/7-14/7?
@MortenKP @Maltehb @pere @versae @Juunge @xxzyx @ThatsGroes @grofte @rasmuskr

3 Likes

Given the small amount of data available in Nynorsk, wouldn’t it make sense to instead finetune a pre-trained model for that task?

1 Like

Hi, I would also like to particapte in this projeet :slight_smile:

1 Like

Hi Dan, I would LOVE to participate on this one :boom::rocket:

2 Likes

I would have loved to but I’m too hung up at the moment

1 Like

Yes of course! We need this, and I think that it will be of huge benefit for the Scandinavian NLP community! Let’s go!

1 Like

I can confirm!

1 Like

Count me in. I’ll be leading another of the projects, so feel free to give my spot to anyone else interested if needed.

1 Like