Scandinavian RoBERTa

saattrupdan · June 28, 2021, 4:11pm

Scandinavian RoBERTa

Rather than training separate models for Swedish, Norwegian, Danish and Icelandic [1][2][3], we could probably produce a better model by pretraining a Scandinavian model and then finetuning to one of the four, considering how similar the four languages are (in its written form that is, cough cough Danish).

We can train a RoBERTa-large model on the combined mC4 dataset, containing 386 GB uncompressed text (179 GB Swedish, 107 GB Danish and 100 GB Norwegian). Furthermore, there are gigaword datasets in both Swedish, Danish and Icelandic that we could use. As suggested in [1], we could start training the model with a sequence length of 128, then 256 and lastly 512.

2. Language

The model will be trained in Swedish, Danish, Norwegian and Icelandic.

3. Model

RoBERTa-large

4. Datasets

Swedish:
- mC4 (179 GB)
- Gigaword [4] (~9 GB compressed)
Danish
- mC4 (107 GB)
- Gigaword [5] (~2 GB compressed)
Norwegian
- mC4 (100 GB)
Icelandic
- mC4 (9 GB)
- Gigaword [6] (~14 GB compressed)

5. Training scripts

There are already Flax scripts to pre-train RoBERTa that we can easily use:

https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling

6. Challenges

Will the data be enough to be able to train a good model?
Will all the languages be well-represented?

7. Desired project outcome

A Scandinavian language model which performs well on the usual benchmarks, on each of the four languages.

8. Reads

[1] https://discuss.huggingface.co/t/pretrain-roberta-large-from-scratch-in-swedish
[2] https://discuss.huggingface.co/t/pretrain-gpt2-from-scratch-in-swedish
[3] https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-norwegian
[4] The Swedish Culturomics Gigaword Corpus | Språkbanken Text
[5] https://gigaword.dk/
[6] Icelandic Gigaword Corpus

versae · June 28, 2021, 5:16pm

This makes sense to me.

Juunge · June 29, 2021, 8:57am

Great idea! A lot of people in the Scandinavian countries would benefit from this one

Maltehb · June 29, 2021, 9:08am

Amazing idea!

Since the languages share linguistic features such a model could provide much better language processing across the Scandinavian countries.

Furthermore, it would add a language model to the otherwise currently scarce number of models with Scandinavian language capabilities.

This needs to happen, and I would love to be a part of this project!

Awesome initiative @saattrupdan

xxzyx · June 29, 2021, 9:48am

Great idea! This will be super helpful for Scandinavian SMEs which often operate across Nordic country borders and want to support multiple Scandinavian languages.

pere · June 29, 2021, 10:20am

Great idea. There are a few other dataset resources as well here, like the Wikipedia dumps and some Reddit/Twitter datasets that might increase the variety/quality of the data. I do however think training a large RoBERTa on a v3-8 is not realistic. Memory limitations will force the batch size will be too small, even on the 128 sequences. Using a base-model architecture will probably lead to a better result.

I will be interested in contributing to this one.

patrickvonplaten · June 29, 2021, 1:42pm

Awesome - lots of details and links already! Finalizing this project

ThatsGroes · June 29, 2021, 5:58pm

This would be very useful!

grofte · June 30, 2021, 10:10am

This sounds like a great project! Love it!

And the tokenizer is ByteLevelBPETokenizer and not WordPiece so I have very little to add =D

As I understand byte pair this is clever because we have very common spelling substitutions between languages e.g. s/c, æ/ä, ø/ö, and å/aa.

With regard to model size there’s obviously some benefits from multiples of 8 (or 32) but could you maybe also take graphics cards RAM sizes into account? Something that can sit comfortably on 8 GB and 12 GB would play pretty well with the current (30) and previous generation of Nvidia cards (too bad for my 2080 Ti with 11 GB).

And do you have anything planned to figure out if Icelandic will actually benefit from this? It might be too small a data set and too different a language?

rasmuskr · July 1, 2021, 1:09pm

This is a great idea. I wish I had heard about it earlier and had more time to prepare for it. I wish you all the best of luck.

pere · July 1, 2021, 4:53pm

I am joining this project but do also initiate a project trying to use T5 for translating between Norwegian Bokmål and Norwegian Nynorsk (Model to translate between Norwegian Bokmål and Norwegian Nynorsk). There might be some overlaps here, at least since we can some of the same corpus. If anyone in addition also want to also work on a Scandinavian Seq2seq model, show your interest in that thread. (Unfortunately I do not think a Scandinavian parallell corpus exists…)

saattrupdan · July 1, 2021, 6:08pm

Hi everyone

As Patrick mentioned above, this project is going to happen!

I’ve set up a repository with a preliminary template here. This includes training scripts and such, and in the readme I’ve included some helpful links related to our project and to the community week in general.

Some links I found useful:

Can y’all confirm whether you’d like to participate during the week 7/7-14/7?
@MortenKP @Maltehb @pere @versae @Juunge @xxzyx @ThatsGroes @grofte @rasmuskr

saattrupdan · July 1, 2021, 6:10pm

Given the small amount of data available in Nynorsk, wouldn’t it make sense to instead finetune a pre-trained model for that task?

Juunge · July 1, 2021, 7:14pm

Hi Dan, I would LOVE to participate on this one

ThatsGroes · July 1, 2021, 11:33pm

I would have loved to but I’m too hung up at the moment

Maltehb · July 2, 2021, 6:10am

Yes of course! We need this, and I think that it will be of huge benefit for the Scandinavian NLP community! Let’s go!

versae · July 2, 2021, 6:57am

Count me in. I’ll be leading another of the projects, so feel free to give my spot to anyone else interested if needed.

saattrupdan · July 2, 2021, 10:29am

Alright, so the team is:

saattrupdan
Gabriel
Juunge
Maltehb
MortenKP
versae
pere

If you’re not on the list and would like to, then let me know and I’ll put you on

saattrupdan · July 2, 2021, 10:31am

I intended to include Faroese, but the data is just so scarce. I mean, we might be able to include it, but it seems like there is less than 1GB of data, so I’m guessing it won’t be performing that well. But we could try! Considering the similarity between Faroese and Icelandic (from what I’ve heard), the model might be decent for Faroese as well

saattrupdan · July 2, 2021, 10:57am

We’re thinking about meeting up (virtually) to say hi to each other and also do some initial planning: what needs to be done, who wants to do what, and so on. What times are you available?

Friday 2 July evening (CEST)
Saturday 3 July morning/midday (CEST)
Saturday 3 July evening (CEST)
Sunday 4 July morning/midday (CEST

0 voters

Topic		Replies	Views
Pretrain RoBERTa-large from scratch in Swedish Flax/JAX Projects	2	1064	July 5, 2021
Pretrain RoBERTa-large from scratch in Finnish Flax/JAX Projects	1	352	June 29, 2021
BERTIN: PreTrain RoBERTa-large from scratch in Spanish Flax/JAX Projects	23	2005	July 19, 2021
PReTrain RoBERTa from scratch in Norwegian Flax/JAX Projects	2	880	June 28, 2021
PreTrain RoBERTa from scratch in Indonesian Flax/JAX Projects	6	549	June 28, 2021

Scandinavian RoBERTa

Scandinavian RoBERTa

2. Language

3. Model

4. Datasets

5. Training scripts

6. Challenges

7. Desired project outcome

8. Reads

Related topics