RoBERTa in Indonesian
There are currently only a few Indonesian RoBERTa models trained from scratch. We aim to create a strong Indonesian RoBERTa language model, to be used for further fine-tuning.
Language
The model will be trained in Indonesian (Bahasa Indonesia).
Model
A randomly initialized RoBERTa model.
Datasets
There is an Indonesian OSCAR dataset, which is also available in Hugging Face’s Datasets.
Alternatively, there are publicly available datasets such as:
- Indo4B by IndoBenchmark
Training scripts
We can make use of the Flax MLM script from Hugging Face to train the model.
(Optional) Challenges
The deduplicated Indonesian OSCAR dataset is of the size 16GB, which may not be the best option available. The Indo4B dataset, on the other hand, sums up a total of 23GB.
(Optional) Desired project outcome
The desired project output is a strong RoBERTa model in Indonesian. To benchmark the results, we can try to fine-tune the output model to downstream Indonesian LM tasks provided by IndoBenchmark.
(Optional) Reads
The following links can be useful to better understand the project and what has previously been done.