RoBERTa/BERT for Norwegian
Currently, there is only a very limited amount of BERT-like models for Thai on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a RoBERTa/BERT model for just the Norwegian language.
Model
A randomly initialized RoBERTa/BERT model
Datasets
One can make use OSCAR the dataset is also available through the datasets
library here: oscar · Datasets at Hugging Face.
Available training scripts
A masked language modeling script for Flax is available here. It can be used pretty much without any required code changes.
(Optional) Desired project outcome
The desired project output is a strong RoBERTa/BERT model in Norwegian.
(Optional) Challenges
The OSCAR dataset might be too small (it has < 5GB of data for Thai). Also it might be important
to find datasets the BERT-like model can be evaluated on after pretraining in Norwegian. Having found a dataset to fine-tune the pretrained BERT-like model on, one can make use of the text-classification script here
(Optional) Links to read upon
The most important read would be the following colab: