GPT2 for Russian
Currently, there is no GPT2 model that was trained from scratch for Turkish on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a strong language generation model for Russian.
Model
A randomly initialized GPT2 model
Datasets
One can make use OSCAR the dataset is also available through the datasets
library here: oscar · Datasets at Hugging Face.
Available training scripts
A causal language modeling script for Flax is available here. It can be used pretty much without any required code changes.
(Optional) Desired project outcome
The desired project output is a GPT2 model that is able to generate Russian language. A nice generation demo can be created for this.
(Optional) Challenges
The dataset on OSCAR is very large: > 300GB. One might want to explore dataset streaming techniques here. Dataset streaming will be merged to datasets
in a couple of days. See PR here and the docs here.
(Optional) Links to read upon
The most important read would be the following colab: