PreTrain RoBERTa from scratch in Portuguese

diiogo · June 24, 2021, 3:32pm

RoBERTa for Portuguese

Currently, there is only a very limited amount of RoBERTa or RoBERTa Large model that was trained from scratch for Portuguese on the hub: Hugging Face – The AI community building the future.. For this project, the goal is to create a RoBERTa model for just the Portuguese language.

Model

A randomly initialized RoBERTa model

Available training scripts

A masked language modeling script for Flax is available here. It can be used pretty much without any required code changes.

Dataset

We can use dataset from BrWaC (Brazilian Web as Corpus) a large Portuguese corpus or Wikidata in Portuguese.

Challenges

Data pre-processing
Sample preparation for LM

(Optional) Desired project outcome

The desired project output is a strong RoBERTa model in Portuguese.

airtonchagas · June 27, 2021, 1:46pm

What a cool project! You can count on me.

patrickvonplaten · June 28, 2021, 4:54pm

Sounds great! Finalizing this project as well One thing that you should think about is also how to evaluate the pretrained model on a downstream task. paws-x · Datasets at Hugging Face or xtreme · Datasets at Hugging Face would make sense I think

Putting everybody in one group for now. You can check it out here.

ronebrandao · June 28, 2021, 5:23pm

That would be great. Wanna be part of it!!

WillFerreiraSantos · June 28, 2021, 8:16pm

Wow, how cool this project is!
Quite innovative idea.
I’ve already signed up to help with the project, you can count on me!
Congratulations to those involved!

diiogo · June 28, 2021, 8:27pm

Thanks for all, at this moment we think to evaluate the pretrained model on HAREM or squad or LeNER.

mSancle · June 29, 2021, 5:58pm

Very interesting project! It’ll be great to participate!

pierreguillou · July 1, 2021, 6:58pm

Good afternoon @diiogo.
I tried to put my username in the spreadsheet but it is now closed.
As I’m interested in the subject, I will follow your work, in particular regarding to the evaluation through downstreams tasks of your Portuguese RoBERTa model.

Question Answering (QA) with SQuAD 1.1 pt : I published 2 models BERT finetuned with this dataset. That will be interesting to compare your f1 e exact-match (link to the SQuAD 1.1 pt BERT models in the HF model hub).
Token Classification with LeNER-br: the dataset is in the datasets hub of HF. I will publish soon a BERT model finetuned with it, too.
Text Classification with ASSIN and ASSIN2 (RTE: Recognizing Textual Entailment): the ASSIN and ASSIN2 (Avaliação de Similaridade Semântica e INferência textual) corpus are corpus annotated with pairs of sentences written in Portuguese that is suitable for the exploration of textual entailment and paraphrasing classifiers. The datasets are in the datasets hub of HF (ASSIN, ASSIN2). I will publish soon a BERT model finetuned with them, too.

It will be great to test you model with other datasets and/or tasks. What do you think?

patrickvonplaten · July 2, 2021, 12:25am

This group has a lot of participation

Giving you guys directly to TPUs tomorrow! Split the team randomly into two in the official google sheet, but this shouldn’t change anything - just that you have access two 2 TPU v3-8s

Might make organization a bit easier to split work on two VMs!

diiogo · July 2, 2021, 12:53pm

Good Morning @pierreguillou.
Its a nice idea to include your results in evaluate of this task (squad).
I think too, we will evaluate the model in other tasks as well.

diiogo · July 2, 2021, 12:53pm

thank you so much

pierreguillou · July 5, 2021, 1:35pm

Hello @diiogo.

I edited my post with datasets ASSIN and ASSIN2 for Text Classification task.

pierreguillou · July 5, 2021, 2:11pm

The BrWaC is in the dataset hub of Hugging Face and OSCAR, too.

It might be interesting to mix the 2 together for training.

About the embeddings and weights initialization, starting from those of a trained RoBERTa model could help, too

Download the model from the HF Model Hub to get the weights.
Train a RoBERTa tokenizer on the training dataset to get the vocabulary.
Keep the ambeddings of the common tokens between the downloaded model vocabulary and the new one.
For the new tokens, initialize them.

Piero · September 27, 2021, 2:30pm

Hi! What’s the current status of this initiative?

diiogo · September 27, 2021, 6:51pm

Hi, now we are doing the training with the fairseq.

Piero · September 28, 2021, 6:16pm

Thanks @diiogo,
Any particular reason for this change? (Fairseq instead of HF Trainer)

diiogo · October 4, 2021, 5:22pm

Hi @Piero,

Some features was unstable or missing in Flax/JAX

Kind regards,
Diogo

Topic		Replies	Views
PReTrain RoBERTa from scratch in Norwegian Flax/JAX Projects	2	880	June 28, 2021
PreTrain RoBERTa from scratch in Thai Flax/JAX Projects	3	647	July 2, 2021
PreTrain RoBERTa from scratch in Indonesian Flax/JAX Projects	6	549	June 28, 2021
BERTIN: PreTrain RoBERTa-large from scratch in Spanish Flax/JAX Projects	23	2004	July 19, 2021
Pretrain RoBERTa-large from scratch in Swedish Flax/JAX Projects	2	1064	July 5, 2021