Currently, there is no GPT2 model that was trained from scratch for Spanish on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a strong language generation model for Spanish.
A causal language modeling script for Flax is available here. It can be used pretty much without any required code changes.
(Optional) Desired project outcome
The desired project output is a GPT2 model that is able to generate Spanish language.
(Optional) Challenges
The dataset on OSCAR is quite large → around 150 GB. One might want to try out dataset streaming to solve this problem. Dataset streaming will be merged to datasets in a couple of days. See PR here and the docs here.
(Optional) Links to read upon
The most important read would be the following colab:
Awesome, great that you guys are already so interested in this
The nice thing about OSCAR is that it is supported in the datasets library and we now also support datasets streaming: Load a Dataset in Streaming mode — datasets 1.8.0 documentation which means that you surely won’t run into any disk space memory problems. We sadly don’t support this out of the box for mC4 (it’s not even in datasets yet), but I’d definitely recommend to try it out! Maybe you can even first train a model on OSCAR and then a second one on mC4
Hello Team,
I am interested to be a part of such an amazing project & team. I will try my best to contribute to the Spanish GPT2 model. It would be nice if we could discuss some more learning resources that would be useful for this project.I can work in any time zone that is comfortable for everyone in the team. It will be interesting to work on both OSCAR & mC4 datasets…