PreTrain GPT2 from scratch in Russian

GPT2 for Russian

Currently, there is no GPT2 model that was trained from scratch for Turkish on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a strong language generation model for Russian.

Model

A randomly initialized GPT2 model

Datasets

One can make use OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

Available training scripts

A causal language modeling script for Flax is available here. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a GPT2 model that is able to generate Russian language. A nice generation demo can be created for this.

(Optional) Challenges

The dataset on OSCAR is very large: > 300GB. One might want to explore dataset streaming techniques here. Dataset streaming will be merged to datasets in a couple of days. See PR here and the docs here.

(Optional) Links to read upon

The most important read would be the following colab:

I’m interested.

1 Like