Currently, there is only a very limited amount of RoBERTa or RoBERTa Large model that was trained from scratch for Portuguese on the hub: Hugging Face – The AI community building the future.. For this project, the goal is to create a RoBERTa model for just the Portuguese language.
Model
A randomly initialized RoBERTa model
Available training scripts
A masked language modeling script for Flax is available here. It can be used pretty much without any required code changes.
Dataset
We can use dataset from BrWaC (Brazilian Web as Corpus) a large Portuguese corpus or Wikidata in Portuguese.
Challenges
Data pre-processing
Sample preparation for LM
(Optional) Desired project outcome
The desired project output is a strong RoBERTa model in Portuguese.
Wow, how cool this project is!
Quite innovative idea.
I’ve already signed up to help with the project, you can count on me!
Congratulations to those involved!
Good afternoon @diiogo.
I tried to put my username in the spreadsheet but it is now closed.
As I’m interested in the subject, I will follow your work, in particular regarding to the evaluation through downstreams tasks of your Portuguese RoBERTa model.
Text Classification with ASSIN and ASSIN2 (RTE: Recognizing Textual Entailment): the ASSIN and ASSIN2 (Avaliação de Similaridade Semântica e INferência textual) corpus are corpus annotated with pairs of sentences written in Portuguese that is suitable for the exploration of textual entailment and paraphrasing classifiers. The datasets are in the datasets hub of HF (ASSIN, ASSIN2). I will publish soon a BERT model finetuned with them, too.
It will be great to test you model with other datasets and/or tasks. What do you think?
Giving you guys directly to TPUs tomorrow! Split the team randomly into two in the official google sheet, but this shouldn’t change anything - just that you have access two 2 TPU v3-8s
Might make organization a bit easier to split work on two VMs!
Good Morning @pierreguillou.
Its a nice idea to include your results in evaluate of this task (squad).
I think too, we will evaluate the model in other tasks as well.