PreTrain RoBERTa from scratch in Portuguese

Good afternoon @diiogo.
I tried to put my username in the spreadsheet but it is now closed.
As I’m interested in the subject, I will follow your work, in particular regarding to the evaluation through downstreams tasks of your Portuguese RoBERTa model.

  • Question Answering (QA) with SQuAD 1.1 pt : I published 2 models BERT finetuned with this dataset. That will be interesting to compare your f1 e exact-match (link to the SQuAD 1.1 pt BERT models in the HF model hub).
  • Token Classification with LeNER-br: the dataset is in the datasets hub of HF. I will publish soon a BERT model finetuned with it, too.
  • Text Classification with ASSIN and ASSIN2 (RTE: Recognizing Textual Entailment): the ASSIN and ASSIN2 (Avaliação de Similaridade Semântica e INferência textual) corpus are corpus annotated with pairs of sentences written in Portuguese that is suitable for the exploration of textual entailment and paraphrasing classifiers. The datasets are in the datasets hub of HF (ASSIN, ASSIN2). I will publish soon a BERT model finetuned with them, too.

It will be great to test you model with other datasets and/or tasks. What do you think?

1 Like