PReTrain RoBERTa from scratch in Norwegian

RoBERTa/BERT for Norwegian

Currently, there is only a very limited amount of BERT-like models for Thai on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a RoBERTa/BERT model for just the Norwegian language.


A randomly initialized RoBERTa/BERT model


One can make use OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

Available training scripts

A masked language modeling script for Flax is available here. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a strong RoBERTa/BERT model in Norwegian.

(Optional) Challenges

The OSCAR dataset might be too small (it has < 5GB of data for Thai). Also it might be important
to find datasets the BERT-like model can be evaluated on after pretraining in Norwegian. Having found a dataset to fine-tune the pretrained BERT-like model on, one can make use of the text-classification script here

(Optional) Links to read upon

The most important read would be the following colab:

1 Like

I am also really interested in this one, or maybe even ALBERT. But the available dataset on OSCAR might not be enough. Using mC4 might add almost 100GB of clean-ish text. I have worked on Norwegian BERT-based models before, and we (the National Library of Norway) have a big corpus that unfortunately we are not allowed to share. We could maybe share parts of it to complement mC4/OSCAR, but not all it. Not sure if there is a way to train on a private corpus in this event, I guess not, but worth asking.

There are also datasets for extrinsic evaluation on Norwegian for POS, NER, and sentiment (the sentiment one I feel was kinda designed for word2vec-like models) from the UiO LTG group, another on political speech party affiliation we built from the Talks of Norway, and a recent one on hate speech we could use. Not sure if evaluating on WikiANN could be an option.

It could make sense to pretrain a common Scandinavian model instead, as our languages are so similar. I’ve made a project suggestion here: Scandinavian RoBERTa.