PreTrain GPT2 from scratch in Polish

GPT2 for Polish

Currently, there is no GPT2 model that was trained from scratch for Polosh on the hub. For this project, the goal is to create a strong language generation model for Polish.

Model

A randomly initialized GPT2 model

Datasets

Pre-training corpora that was used to train Polish Bert by Darek Kłeczek.

Uncased
Polish subset of Open Subtitles
Polish subset of ParaCrawl
Polish Parliamentary Corpus
Polish Wikipedia

Cased
Polish subset of Open Subtitles
Polish subset of ParaCrawl
Polish Parliamentary Corpus
Polish Wikipedia - Feb 2020

One can make use OSCAR the dataset is also available through the datasets library - Datasets at Hugging Face.

Available training scripts

A causal language modeling script for Flax is available here. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a GPT2 model that is able to generate Polish language. A nice generation demo can be created for this.

(Optional) Challenges

It might be possible that there is not enough data for the model to perform reasonably well on text generation. In this case, one would have to look at other datasets as well, like mc4 .

(Optional) Links to read upon

The most important read would be the following colab:

1 Like

Noting this down since Darek will be joining the project!

2 Likes

I’m in! One thing that pushed me into ML a few years ago was a wish to generate a Polish nursery rhyme for my daughter, hoping with this one we can achieve it :smiley:

2 Likes