RoBERTa in Indonesian
There are currently only a few Indonesian RoBERTa models trained from scratch. We aim to create a strong Indonesian RoBERTa language model, to be used for further fine-tuning.
The model will be trained in Indonesian (Bahasa Indonesia).
A randomly initialized RoBERTa model.
There is an Indonesian OSCAR dataset, which is also available in Hugging Face’s Datasets.
Alternatively, there are publicly available datasets such as:
We can make use of the Flax MLM script from Hugging Face to train the model.
The deduplicated Indonesian OSCAR dataset is of the size 16GB, which may not be the best option available. The Indo4B dataset, on the other hand, sums up a total of 23GB.
(Optional) Desired project outcome
The desired project output is a strong RoBERTa model in Indonesian. To benchmark the results, we can try to fine-tune the output model to downstream Indonesian LM tasks provided by IndoBenchmark.
The following links can be useful to better understand the project and what has previously been done.
Great Idea, would be great to expand the Indonesian NLP community. Count me in.
Awesome, thanks for the suggestions!
Each team will only be provided one TPU-V3 with 4 TPU, will it be enough to train the largest model from scratch?
Correct me if I’m wrong, but what I have read in the official announcement is that we are going to get TPUv3-8 VM, which has 8 cores in total. Though I’m not entirely sure of its maximum capability to train a very large language model, other proposals have similarly proposed to pre-train language models from scratch as well, only in a different language. Let me know if I’m wrong though.
Yes, it is TPU V3-8 with 4 TPUs and 8 cores, equivalent to 4 V100 GPUs.
Anyway, regardless of it capability we can still give it a try, at least can gain some learning experience.
I am in for this project.
Awesome! let’s officially define this project
Putting everybody in the official sheet here. More people can still join! Leave a comment here or on the sheet if you want to change something.