PreTrain RoBERTa from scratch in Portuguese

RoBERTa for Portuguese

Currently, there is only a very limited amount of RoBERTa or RoBERTa Large model that was trained from scratch for Portuguese on the hub: Hugging Face – The AI community building the future.. For this project, the goal is to create a RoBERTa model for just the Portuguese language.

Model

A randomly initialized RoBERTa model

Available training scripts

A masked language modeling script for Flax is available here. It can be used pretty much without any required code changes.

Dataset

We can use dataset from BrWaC (Brazilian Web as Corpus) a large Portuguese corpus or Wikidata in Portuguese.

Challenges

  • Data pre-processing
  • Sample preparation for LM

(Optional) Desired project outcome

The desired project output is a strong RoBERTa model in Portuguese.

5 Likes

This is an amazing idea! I’m really looking foward to be part of this team!

2 Likes

What a cool project! You can count on me.

2 Likes

Really interesting! I’m in!

2 Likes

Wow, it’s a great project. I’m applying to help in this project.

2 Likes

Awesome idea! I am really interested to contribute to this project.

2 Likes

Sounds great! Finalizing this project as well :slight_smile: One thing that you should think about is also how to evaluate the pretrained model on a downstream task. paws-x · Datasets at Hugging Face or xtreme · Datasets at Hugging Face would make sense I think :slight_smile:

Putting everybody in one group for now. You can check it out here.

3 Likes

That would be great. Wanna be part of it!!

2 Likes

Wow, how cool this project is!
Quite innovative idea.
I’ve already signed up to help with the project, you can count on me!
Congratulations to those involved!
:grinning: :clap:

2 Likes

Thanks for all, at this moment we think to evaluate the pretrained model on HAREM or squad or LeNER.

Hey man, it’s a very good idea. I want participate.

2 Likes

Very interesting project! It’ll be great to participate!

1 Like

Hey! Awesome project! I’m in

2 Likes

Good afternoon @diiogo.
I tried to put my username in the spreadsheet but it is now closed.
As I’m interested in the subject, I will follow your work, in particular regarding to the evaluation through downstreams tasks of your Portuguese RoBERTa model.

  • Question Answering (QA) with SQuAD 1.1 pt : I published 2 models BERT finetuned with this dataset. That will be interesting to compare your f1 e exact-match (link to the SQuAD 1.1 pt BERT models in the HF model hub).
  • Token Classification with LeNER-br: the dataset is in the datasets hub of HF. I will publish soon a BERT model finetuned with it, too.
  • Text Classification with ASSIN and ASSIN2 (RTE: Recognizing Textual Entailment): the ASSIN and ASSIN2 (Avaliação de Similaridade Semântica e INferência textual) corpus are corpus annotated with pairs of sentences written in Portuguese that is suitable for the exploration of textual entailment and paraphrasing classifiers. The datasets are in the datasets hub of HF (ASSIN, ASSIN2). I will publish soon a BERT model finetuned with them, too.

It will be great to test you model with other datasets and/or tasks. What do you think?

1 Like

This group has a lot of participation :slight_smile:

Giving you guys directly to TPUs tomorrow! Split the team randomly into two in the official google sheet, but this shouldn’t change anything - just that you have access two 2 TPU v3-8s :slight_smile:

Might make organization a bit easier to split work on two VMs!

1 Like

Good Morning @pierreguillou.
Its a nice idea to include your results in evaluate of this task (squad).
I think too, we will evaluate the model in other tasks as well.

thank you so much

Hello @diiogo.

I edited my post with datasets ASSIN and ASSIN2 for Text Classification task.

The BrWaC is in the dataset hub of Hugging Face and OSCAR, too.

It might be interesting to mix the 2 together for training.

About the embeddings and weights initialization, starting from those of a trained RoBERTa model could help, too

  1. Download the model from the HF Model Hub to get the weights.
  2. Train a RoBERTa tokenizer on the training dataset to get the vocabulary.
  3. Keep the ambeddings of the common tokens between the downloaded model vocabulary and the new one.
  4. For the new tokens, initialize them.

Hi! What’s the current status of this initiative?