PreTrain GPT2 from scratch in Swedish

GPT2 for Swedish

Currently, there is no GPT2 model that was trained from scratch for Swedish on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a strong language generation model for Swedish.

Model

A randomly initialized GPT2 model

Datasets

One can make use OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

Available training scripts

A causal language modeling script for Flax is available here. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a GPT2 model that is able to generate Spanish language. A nice generation demo can be created for this.

(Optional) Challenges

It might be possible that there is not enough data for the model to perform reasonably well on text generation. In this case, one would have to look at other datasets as well, like mc4.

(Optional) Links to read upon

The most important read would be the following colab:

1 Like

This sounds like a great project.

Just for reference, there is a Swedish generative model based on Flashback data.

Hi! Considering the similarity between the Scandinavian languages, I suggest we might achieve a higher performance by utilising data from all of the languages. Just a suggestion. I made a project proposal here: Scandinavian RoBERTa. It’s using RoBERTa and not GPT-2 however, but I’m not too fuzzed about the model architecture, to be honest.

1 Like

Cool, finalizing this project :slight_smile:

Think we can have both a Scandinavian and a Swedish GPT2 project. It would be great to find some more people working on GPT2 in Swedish as well :slight_smile:

I’ve just put @birgermoell and @Gabriel now