BERTIN: PreTrain RoBERTa-large from scratch in Spanish

BERTIN

The idea is to use the Spanish portion of mC4 (which roughly amounts for 1TB of uncompressed text) to pre-train a RoBERTa-large model, first on 256 sequence length and then on 512. It might be a good idea to start in 128, as suggested here.

2. Language

The model will be trained in Spanish (regardless of variety).

3. Model

RoBERTa-large

4. Datasets

Spanish portion of mC4 of about 1TB of uncompressed data.

5. Training scripts

There are already Flax scripts to pre-train RoBERTa that we can easily use:

https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)

6. Challenges

It is too much data. We need a way to reduce the amount of data to finish on time. Options:

  1. Random sampling.
  2. Perplexity sampling using percentiles and a Spanish language model. One option here is to use a 5-gram Kneser-Ney model as implemented in the KenLM library (Heafield, 2011) and released by Facebook.

7. Desired project outcome

A Spanish monolingual well performing model on the usual benchmarks.

8. Reads

- https://arxiv.org/pdf/1907.11692.pdf
- https://arxiv.org/pdf/1911.00359.pdf
- https://www.aclweb.org/anthology/W11-2123.pdf
9 Likes

So interesting! I am in!

1 Like

Me too! Nice idea :hugs:

2 Likes

Yes! Count me in!

1 Like

Hey! Awesome project! I’m in :stuck_out_tongue:

1 Like

I think you have already a full working team, but just in case you need more people, I’d be willing to help. I don’t know if I could be of much use, but the opportunity to contribute to a model called BERTIN is too good to let it go :grin:

1 Like

Awesome to see so much interest here! We can also make multiple teams for the same project - so feel free to keep posting if you’re interested :slight_smile:

3 Likes

For this project we can have at least 2 teams!

1 Like

This project sounds good! I’d like to help with either this one or the Spanish GPT-2, wherever makes more sense to add one person.

1 Like

I also think this is a very good effort. If I can be of any help please count me in.

1 Like

Fantastic. I’d love to participate in one of the teams, if possible. Thanks!

1 Like

What a good idea. Please, count me in if possible!!

1 Like

Hello @versae & Team,
I am interested to be a part of such an amazing project & team. I will try my best to contribute to the BERTIN model. It would be nice if we could discuss some more learning resources that would be useful for this project.I can work in any time zone that is comfortable for everyone in the team.

1 Like

I’m overwhelmed by the good reception of the idea. Thanks y’all! If we are awarded the TPUs, we must decide on a communication channel a bit more effective than the forum, maybe. A Slack channel? Discord? E-mail?

1 Like

It was a really nice idea @versae! If you want we could use the Slack group of NLP en ES :hugs: and create a channel for us. Some of the people in this thread are already in the community, if the others wanna join you can use this invitation.

2 Likes

This seems like a super interesting project. I would love to be a part of this project.

1 Like

This sounds awesome! I would be interested in joining as well, if there’s still room for more people! :slight_smile:

2 Likes

Super exciting about this project with so much interest! Given that we have so many people, I think it makes sense to officially define 2 groups of 7 people (more people can join later :-))

I’ve divided people that are interested into two groups and added all the names here.

Let me know if you don’t want to be in the team and I take you out :slight_smile:

I think the two teams should talk to each other/ help each other as much as possible, but as stated on Slack before having a single team of 14 is not efficient IMO. Therefore, I propose that both teams take their own approach to doing the dataset preprocessing, model definition etc…

One important aspect that should also be discussed a bit is how the pretrained RoBERTa model should be evaluated ? Maybe on XTREME after pretraining?

3 Likes

XTREME and PAWS-X could be a good starting point

1 Like

I think XTREME is perfect. Not sure if XTREME superseeds PAWS-X or they are just different.

So if I understand this correctly, this means we could pre-train 2 versions of BERTIN?

It that’s the case, that’s something the 2 teams need to talk about and agree on. My cents would be to test the 2 different approaches to shrink the corpus size: random sampling, perplexity sampling.