Model to translate between Norwegian Bokmål and Norwegian Nynorsk

pere · June 30, 2021, 6:09pm

Norwegian has two different written standards or variants, Bokmål and Nynorsk. Some mean they should be considered two different languages, but with extensive lexical as well as grammatical overlap. In many cases a word-by-word translation suffices, but there are for instance also some auxiliary and voice constructions that are different in addition to differing pronoun systems. Students and civil servants are expected to be fluent in both languages and you have as a Norwegian citizen the right to receive public documents in your preferred language. Nynorsk is by far the minority, used in less than 20% of the cases.

Translation between Bokmål and Nynorsk seems like a low hanging machine translation task, especially since the loss can be calculated more effectively. Even not translating at all, will probably give you a BLEU score in the area of 20-30 because of the overlaps.

There are no recent machine translation-models trying to do this translation task, most likely due to the lack of available corpora. We are however able to make such a dataset available for this workshop.

Language

We aim at using as much of the available Flax code as possible.

Model

To be decided by the team. To our knowledge, both BERT-infused NML-models, T5-like models and BART might be suitable for such a task. Model choice might be influenced based on what modules are available in FLAX and performance.

Datasets

For this workshop we are able to make publicly available two pre-prepared datasets collected though other projects:

A balanced (5+5=10GB) corpus in Bokmål and Nynorsk. Collected through various sources, MC4, Oscar, Wikipedia, public documents, pdf-scans. There is even more text available in Bokmål here, but we feel that a balanced set would be best for this task.
A high quality parallel corpus of 100.000 sentence-pairs in Bokmål and Nynorsk.

The sets are already thoroughly cleaned, and we will make sure they are ready to import by the start of the workshop.

Training scripts

The goal is to use the current scripts as much as possible due to the short timespan.

Challenges

The first challenge is to find a model architecture that can give a reasonable result in this timeframe. It is however reasonable to assume that this will be a much easier task than translating from for instance English to German.

For the project to be realistic, we need team members with previous experience with machine translation. The proposers have solid experience in training large language models, but lack experience with translation. Since the timeframe for the workshop is short, we should find suitable models already implemented in Flax.

No knowledge of Norwegian is necessary.

Desired project outcome

It is a tech demo, hoping to show that it is doable to build a mainly unsupervised machine learning algorithm with Flax in a week.

versae · July 1, 2021, 9:33am

This is an interesting idea, but I wonder if we’d had enough data to test a simple fine-tuning first.

pere · July 1, 2021, 10:50am

We might get a decent result here by just finetuning mBART on the Bokmål-Nynorsk parallell corpus. Unfortunately Nynorsk is a minority language (really a minority subset of a minority language).

pere · July 1, 2021, 4:46pm

According to todays presentations. Pretraining scripts for BART is not yet available. A T5 model would probably do fine here, and both pretraining and finetuning scripts are available.

patrickvonplaten · July 2, 2021, 12:04am

Awesome finalizing this project then

Topic		Replies	Views
Pretrain GPT2 from scratch in Korean Flax/JAX Projects	3	989	July 16, 2021
Scandinavian RoBERTa Flax/JAX Projects	30	2041	July 15, 2021
PReTrain RoBERTa from scratch in Norwegian Flax/JAX Projects	2	880	June 28, 2021
PreTrain GPT-2 from scratch for German on novel GC4 dataset Flax/JAX Projects	7	1200	July 2, 2021
PreTrain T5 for Italian 🇮🇹 Flax/JAX Projects	3	618	July 7, 2021