Pretrain GPT2 from scratch in Romanian

Pretrain GPT2 from scratch in Romanian

Train a Romanian GPT2 on a 24 GB Romanian dataset

2. Language

Currently, there is no GPT2 model that was trained from scratch for Romanian on the hub.
For this project, the goal is to create a strong language generation model for Romanian.

3. Model

A randomly initialized GPT2 model

4. Datasets

We created a dataset from OSCAR, text from books and web crawled datasets.
The entire dataset has 24 GB.

5. Training scripts

A causal language modeling script for FLAX is available here:

6. (Optional) Challenges

<(Optionally) FILL ME: 6. What are possible challenges?>

7. (Optional) Desired project outcome

There are no causal language models for Romanian language. This will be the first causal model for Romanian language

8. (Optional) Reads

The following links can be useful to better understand the project and
what has previously been done.


This is interesting, as it would be the first Romanian model that can be used for text generation. I highly support this project!


I highly encourage this project as the current multi-lingual text generation models do not work too well for Romanian.


This would be really helpful for my team.

1 Like

Let’s define the project here, mihaitensor! :slight_smile: I’ll add you to the team dumitrescustefan, avramandrei, andreidima – please leave a comment in the Google sheet to be removed if necessary.

1 Like

Thank you for adding the team Suzana :smiley:

1 Like