BERTIN: PreTrain RoBERTa-large from scratch in Spanish

versae · June 23, 2021, 3:34pm

BERTIN

The idea is to use the Spanish portion of mC4 (which roughly amounts for 1TB of uncompressed text) to pre-train a RoBERTa-large model, first on 256 sequence length and then on 512. It might be a good idea to start in 128, as suggested here.

2. Language

The model will be trained in Spanish (regardless of variety).

3. Model

RoBERTa-large

4. Datasets

Spanish portion of mC4 of about 1TB of uncompressed data.

5. Training scripts

There are already Flax scripts to pre-train RoBERTa that we can easily use:

https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)

6. Challenges

It is too much data. We need a way to reduce the amount of data to finish on time. Options:

Random sampling.
Perplexity sampling using percentiles and a Spanish language model. One option here is to use a 5-gram Kneser-Ney model as implemented in the KenLM library (Heafield, 2011) and released by Facebook.

7. Desired project outcome

A Spanish monolingual well performing model on the usual benchmarks.

8. Reads

- https://arxiv.org/pdf/1907.11692.pdf
- https://arxiv.org/pdf/1911.00359.pdf
- https://www.aclweb.org/anthology/W11-2123.pdf

mrm8488 · June 23, 2021, 3:38pm

So interesting! I am in!

mariagrandury · June 23, 2021, 3:57pm

Me too! Nice idea

josecalvo · June 23, 2021, 5:36pm

Yes! Count me in!

vitojph · June 24, 2021, 6:47am

Hey! Awesome project! I’m in

Paulo · June 24, 2021, 7:50am

I think you have already a full working team, but just in case you need more people, I’d be willing to help. I don’t know if I could be of much use, but the opportunity to contribute to a model called BERTIN is too good to let it go

patrickvonplaten · June 24, 2021, 8:31am

Awesome to see so much interest here! We can also make multiple teams for the same project - so feel free to keep posting if you’re interested

mrm8488 · June 24, 2021, 8:32am

For this project we can have at least 2 teams!

Pablogps · June 24, 2021, 9:37am

This project sounds good! I’d like to help with either this one or the Spanish GPT-2, wherever makes more sense to add one person.

hcordobes · June 24, 2021, 10:56am

I also think this is a very good effort. If I can be of any help please count me in.

arteagac · June 24, 2021, 4:50pm

Fantastic. I’d love to participate in one of the teams, if possible. Thanks!

jorgealro · June 25, 2021, 8:57am

What a good idea. Please, count me in if possible!!

srisweet · June 27, 2021, 5:45am

Hello @versae & Team,
I am interested to be a part of such an amazing project & team. I will try my best to contribute to the BERTIN model. It would be nice if we could discuss some more learning resources that would be useful for this project.I can work in any time zone that is comfortable for everyone in the team.

versae · June 27, 2021, 6:56pm

I’m overwhelmed by the good reception of the idea. Thanks y’all! If we are awarded the TPUs, we must decide on a communication channel a bit more effective than the forum, maybe. A Slack channel? Discord? E-mail?

mariagrandury · June 27, 2021, 9:40pm

It was a really nice idea @versae! If you want we could use the Slack group of NLP en ES and create a channel for us. Some of the people in this thread are already in the community, if the others wanna join you can use this invitation.

birgermoell · June 28, 2021, 9:27am

This seems like a super interesting project. I would love to be a part of this project.

edugp · June 28, 2021, 3:51pm

This sounds awesome! I would be interested in joining as well, if there’s still room for more people!

patrickvonplaten · June 28, 2021, 4:10pm

Super exciting about this project with so much interest! Given that we have so many people, I think it makes sense to officially define 2 groups of 7 people (more people can join later :-))

I’ve divided people that are interested into two groups and added all the names here.

Let me know if you don’t want to be in the team and I take you out

I think the two teams should talk to each other/ help each other as much as possible, but as stated on Slack before having a single team of 14 is not efficient IMO. Therefore, I propose that both teams take their own approach to doing the dataset preprocessing, model definition etc…

One important aspect that should also be discussed a bit is how the pretrained RoBERTa model should be evaluated ? Maybe on XTREME after pretraining?

mrm8488 · June 28, 2021, 4:25pm

XTREME and PAWS-X could be a good starting point

versae · June 28, 2021, 4:27pm

I think XTREME is perfect. Not sure if XTREME superseeds PAWS-X or they are just different.

So if I understand this correctly, this means we could pre-train 2 versions of BERTIN?

It that’s the case, that’s something the 2 teams need to talk about and agree on. My cents would be to test the 2 different approaches to shrink the corpus size: random sampling, perplexity sampling.

Topic		Replies	Views
Pretrain RoBERTa-large from scratch in Swedish Flax/JAX Projects	2	1064	July 5, 2021
Pretrain RoBERTa-large from scratch in Finnish Flax/JAX Projects	1	352	June 29, 2021
PreTrain RoBERTa from scratch in Portuguese Flax/JAX Projects	16	2440	October 4, 2021
RobIt : PreTrain RoBERTa-base from scratch in Italian Flax/JAX Projects	4	478	June 29, 2021
Scandinavian RoBERTa Flax/JAX Projects	30	2041	July 15, 2021