PreTrain RoBERTa from scratch in Indonesian

RoBERTa in Indonesian

There are currently only a few Indonesian RoBERTa models trained from scratch. We aim to create a strong Indonesian RoBERTa language model, to be used for further fine-tuning.

Language

The model will be trained in Indonesian (Bahasa Indonesia).

Model

A randomly initialized RoBERTa model.

Datasets

There is an Indonesian OSCAR dataset, which is also available in Hugging Face’s Datasets.
Alternatively, there are publicly available datasets such as:

Training scripts

We can make use of the Flax MLM script from Hugging Face to train the model.

(Optional) Challenges

The deduplicated Indonesian OSCAR dataset is of the size 16GB, which may not be the best option available. The Indo4B dataset, on the other hand, sums up a total of 23GB.

(Optional) Desired project outcome

The desired project output is a strong RoBERTa model in Indonesian. To benchmark the results, we can try to fine-tune the output model to downstream Indonesian LM tasks provided by IndoBenchmark.

(Optional) Reads

The following links can be useful to better understand the project and what has previously been done.

3 Likes

Great Idea, would be great to expand the Indonesian NLP community. Count me in.

1 Like

we could use

as alternative

1 Like

Awesome, thanks for the suggestions!

Each team will only be provided one TPU-V3 with 4 TPU, will it be enough to train the largest model from scratch?

Correct me if I’m wrong, but what I have read in the official announcement is that we are going to get TPUv3-8 VM, which has 8 cores in total. Though I’m not entirely sure of its maximum capability to train a very large language model, other proposals have similarly proposed to pre-train language models from scratch as well, only in a different language. Let me know if I’m wrong though.

Cheers!

Yes, it is TPU V3-8 with 4 TPUs and 8 cores, equivalent to 4 V100 GPUs.
Anyway, regardless of it capability we can still give it a try, at least can gain some learning experience.
I am in for this project.

Awesome! let’s officially define this project :slight_smile:

Putting everybody in the official sheet here. More people can still join! Leave a comment here or on the sheet if you want to change something.

1 Like