Pretrain GPT2 from scratch in Romanian
Train a Romanian GPT2 on a 24 GB Romanian dataset
Currently, there is no GPT2 model that was trained from scratch for Romanian on the hub.
For this project, the goal is to create a strong language generation model for Romanian.
A randomly initialized GPT2 model
We created a dataset from OSCAR, text from books and web crawled datasets.
The entire dataset has 24 GB.
5. Training scripts
A causal language modeling script for FLAX is available here:
6. (Optional) Challenges
<(Optionally) FILL ME: 6. What are possible challenges?>
7. (Optional) Desired project outcome
There are no causal language models for Romanian language. This will be the first causal model for Romanian language
8. (Optional) Reads
The following links can be useful to better understand the project and
what has previously been done.
This is interesting, as it would be the first Romanian model that can be used for text generation. I highly support this project!
I highly encourage this project as the current multi-lingual text generation models do not work too well for Romanian.
This would be really helpful for my team.
Let’s define the project here, mihaitensor! I’ll add you to the team dumitrescustefan, avramandrei, andreidima – please leave a comment in the Google sheet to be removed if necessary.
Thank you for adding the team Suzana