Currently, there is only a very limited amount of RoBERTa or RoBERTa Large model that was trained from scratch for Portuguese on the hub: Hugging Face – The AI community building the future.. For this project, the goal is to create a RoBERTa model for just the Portuguese language.
A randomly initialized RoBERTa model
A masked language modeling script for Flax is available here. It can be used pretty much without any required code changes.
We can use dataset from BrWaC (Brazilian Web as Corpus) a large Portuguese corpus or Wikidata in Portuguese.
- Data pre-processing
- Sample preparation for LM
The desired project output is a strong RoBERTa model in Portuguese.