GPT2 for Polish
Currently, there is no GPT2 model that was trained from scratch for Polosh on the hub. For this project, the goal is to create a strong language generation model for Polish.
Model
A randomly initialized GPT2 model
Datasets
Pre-training corpora that was used to train Polish Bert by Darek Kłeczek.
Uncased
Polish subset of Open Subtitles
Polish subset of ParaCrawl
Polish Parliamentary Corpus
Polish Wikipedia
Cased
Polish subset of Open Subtitles
Polish subset of ParaCrawl
Polish Parliamentary Corpus
Polish Wikipedia - Feb 2020
One can make use OSCAR the dataset is also available through the datasets
library - Datasets at Hugging Face.
Available training scripts
A causal language modeling script for Flax is available here. It can be used pretty much without any required code changes.
(Optional) Desired project outcome
The desired project output is a GPT2 model that is able to generate Polish language. A nice generation demo can be created for this.
(Optional) Challenges
It might be possible that there is not enough data for the model to perform reasonably well on text generation. In this case, one would have to look at other datasets as well, like mc4
.
(Optional) Links to read upon
The most important read would be the following colab:
-
Polish BERT by Darek Kłeczek