PreTrain GPT2 from scratch in Spanish

GPT2 for Spanish

Currently, there is no GPT2 model that was trained from scratch for Spanish on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a strong language generation model for Spanish.

Model

A randomly initialized GPT2 model

Datasets

One can make use OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

Available training scripts

A causal language modeling script for Flax is available here. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a GPT2 model that is able to generate Spanish language.

(Optional) Challenges

The dataset on OSCAR is quite large → around 150 GB. One might want to try out dataset streaming to solve this problem. Dataset streaming will be merged to datasets in a couple of days. See PR here and the docs here.

(Optional) Links to read upon

The most important read would be the following colab:

Other reads that might be interesting include:

7 Likes

I’m definitely in! Check this @mrm8488 it’s exactly what we talked about

Yes, it is! I was thinking about using Spanish portion of mC4 instead of OSCAR. What do you think @patrickvonplaten and @mariagrandury ??

Awesome, great that you guys are already so interested in this :slight_smile:
The nice thing about OSCAR is that it is supported in the datasets library and we now also support datasets streaming: Load a Dataset in Streaming mode — datasets 1.8.0 documentation which means that you surely won’t run into any disk space memory problems. We sadly don’t support this out of the box for mC4 (it’s not even in datasets yet), but I’d definitely recommend to try it out! Maybe you can even first train a model on OSCAR and then a second one on mC4

1 Like

Great, @patrickvonplaten. We will. So we have the script, the dataset and the knowledge. Let’s start the party!

2 Likes

This sounds good! As I said in the project for BERTIN, I’d like to help in either one.

2 Likes

I’d like to join this project too! Maybe we could also finetune the model to bring Violadores del Verso (Spanish rap) back to life :stuck_out_tongue_winking_eye:

Hello Team,
I am interested to be a part of such an amazing project & team. I will try my best to contribute to the Spanish GPT2 model. It would be nice if we could discuss some more learning resources that would be useful for this project.I can work in any time zone that is comfortable for everyone in the team. It will be interesting to work on both OSCAR & mC4 datasets…

Awesome! let’s officially define this project :slight_smile:

Putting everybody in the official sheet here. More people can still join! Leave a comment here or on the sheet if you want to change something.

1 Like

hi @valhalla ,

I am definitely interested in joining. Could you add me to the spreadhset?

Thanks,

I would love to contribute to this project :smiley:

@jdposa , @shpotes , you are part of the team :slight_smile:

1 Like

Hey, count with me if possible! I’d like to contribute

1 Like

you’re in :slight_smile:

1 Like