PreTrain GPT2 from scratch in Spanish

patrickvonplaten · June 23, 2021, 11:07am

GPT2 for Spanish

Currently, there is no GPT2 model that was trained from scratch for Spanish on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a strong language generation model for Spanish.

Model

A randomly initialized GPT2 model

Datasets

One can make use OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

Available training scripts

A causal language modeling script for Flax is available here. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a GPT2 model that is able to generate Spanish language.

(Optional) Challenges

The dataset on OSCAR is quite large → around 150 GB. One might want to try out dataset streaming to solve this problem. Dataset streaming will be merged to datasets in a couple of days. See PR here and the docs here.

(Optional) Links to read upon

The most important read would be the following colab:

Google Colaboratory

Other reads that might be interesting include:

mariagrandury · June 23, 2021, 3:41pm

I’m definitely in! Check this @mrm8488 it’s exactly what we talked about

mrm8488 · June 23, 2021, 3:43pm

Yes, it is! I was thinking about using Spanish portion of mC4 instead of OSCAR. What do you think @patrickvonplaten and @mariagrandury ??

patrickvonplaten · June 23, 2021, 7:57pm

Awesome, great that you guys are already so interested in this
The nice thing about OSCAR is that it is supported in the datasets library and we now also support datasets streaming: Load a Dataset in Streaming mode — datasets 1.8.0 documentation which means that you surely won’t run into any disk space memory problems. We sadly don’t support this out of the box for mC4 (it’s not even in datasets yet), but I’d definitely recommend to try it out! Maybe you can even first train a model on OSCAR and then a second one on mC4

mrm8488 · June 23, 2021, 8:33pm

Great, @patrickvonplaten. We will. So we have the script, the dataset and the knowledge. Let’s start the party!

Pablogps · June 24, 2021, 9:38am

This sounds good! As I said in the project for BERTIN, I’d like to help in either one.

srisweet · June 27, 2021, 5:49am

Hello Team,
I am interested to be a part of such an amazing project & team. I will try my best to contribute to the Spanish GPT2 model. It would be nice if we could discuss some more learning resources that would be useful for this project.I can work in any time zone that is comfortable for everyone in the team. It will be interesting to work on both OSCAR & mC4 datasets…

valhalla · June 28, 2021, 4:54pm

Awesome! let’s officially define this project

Putting everybody in the official sheet here. More people can still join! Leave a comment here or on the sheet if you want to change something.

jdposa · June 28, 2021, 11:21pm

hi @valhalla ,

I am definitely interested in joining. Could you add me to the spreadhset?

Thanks,

shpotes · June 28, 2021, 11:24pm

I would love to contribute to this project

valhalla · June 30, 2021, 8:53am

@jdposa , @shpotes , you are part of the team

jorgealro · July 1, 2021, 8:15am

Hey, count with me if possible! I’d like to contribute

patrickvonplaten · July 1, 2021, 9:43am

you’re in

Topic		Replies	Views
PreTrain GPT2 from scratch in Russian Flax/JAX Projects	1	674	July 1, 2021
PreTrain GPT2 from scratch in Swedish Flax/JAX Projects	4	983	June 29, 2021
PreTrain GPT2-Large (and/or GPT2-XL) from scratch in Portuguese Flax/JAX Projects	0	736	June 24, 2021
PreTrain GPT2 from scratch in Indonesia Flax/JAX Projects	13	761	June 30, 2021
Pretrain GPT-2 from scratch in Thai Flax/JAX Projects	0	921	July 18, 2021