The idea is to use the Spanish portion of mC4 (which roughly amounts for 1TB of uncompressed text) to pre-train a RoBERTa-large model, first on 256 sequence length and then on 512. It might be a good idea to start in 128, as suggested here.
2. Language
The model will be trained in Spanish (regardless of variety).
3. Model
RoBERTa-large
4. Datasets
Spanish portion of mC4 of about 1TB of uncompressed data.
5. Training scripts
There are already Flax scripts to pre-train RoBERTa that we can easily use:
It is too much data. We need a way to reduce the amount of data to finish on time. Options:
Random sampling.
Perplexity sampling using percentiles and a Spanish language model. One option here is to use a 5-gram Kneser-Ney model as implemented in the KenLM library (Heafield, 2011) and released by Facebook.
7. Desired project outcome
A Spanish monolingual well performing model on the usual benchmarks.
I think you have already a full working team, but just in case you need more people, I’d be willing to help. I don’t know if I could be of much use, but the opportunity to contribute to a model called BERTIN is too good to let it go
Hello @versae & Team,
I am interested to be a part of such an amazing project & team. I will try my best to contribute to the BERTIN model. It would be nice if we could discuss some more learning resources that would be useful for this project.I can work in any time zone that is comfortable for everyone in the team.
I’m overwhelmed by the good reception of the idea. Thanks y’all! If we are awarded the TPUs, we must decide on a communication channel a bit more effective than the forum, maybe. A Slack channel? Discord? E-mail?
It was a really nice idea @versae! If you want we could use the Slack group of NLP en ES and create a channel for us. Some of the people in this thread are already in the community, if the others wanna join you can use this invitation.
Super exciting about this project with so much interest! Given that we have so many people, I think it makes sense to officially define 2 groups of 7 people (more people can join later :-))
I’ve divided people that are interested into two groups and added all the names here.
Let me know if you don’t want to be in the team and I take you out
I think the two teams should talk to each other/ help each other as much as possible, but as stated on Slack before having a single team of 14 is not efficient IMO. Therefore, I propose that both teams take their own approach to doing the dataset preprocessing, model definition etc…
One important aspect that should also be discussed a bit is how the pretrained RoBERTa model should be evaluated ? Maybe on XTREME after pretraining?
I think XTREME is perfect. Not sure if XTREME superseeds PAWS-X or they are just different.
So if I understand this correctly, this means we could pre-train 2 versions of BERTIN?
It that’s the case, that’s something the 2 teams need to talk about and agree on. My cents would be to test the 2 different approaches to shrink the corpus size: random sampling, perplexity sampling.