PreTrain RoBERTa from scratch in Indonesian

w11wo · June 24, 2021, 9:28am

RoBERTa in Indonesian

There are currently only a few Indonesian RoBERTa models trained from scratch. We aim to create a strong Indonesian RoBERTa language model, to be used for further fine-tuning.

Language

The model will be trained in Indonesian (Bahasa Indonesia).

Model

A randomly initialized RoBERTa model.

Datasets

There is an Indonesian OSCAR dataset, which is also available in Hugging Face’s Datasets.
Alternatively, there are publicly available datasets such as:

Indo4B by IndoBenchmark

Training scripts

We can make use of the Flax MLM script from Hugging Face to train the model.

(Optional) Challenges

The deduplicated Indonesian OSCAR dataset is of the size 16GB, which may not be the best option available. The Indo4B dataset, on the other hand, sums up a total of 23GB.

(Optional) Desired project outcome

The desired project output is a strong RoBERTa model in Indonesian. To benchmark the results, we can try to fine-tune the output model to downstream Indonesian LM tasks provided by IndoBenchmark.

(Optional) Reads

The following links can be useful to better understand the project and what has previously been done.

Google Colaboratory

munggok · June 24, 2021, 6:28pm

we could use

as alternative

w11wo · June 25, 2021, 2:45am

Awesome, thanks for the suggestions!

chewkokwah · June 26, 2021, 1:37am

Each team will only be provided one TPU-V3 with 4 TPU, will it be enough to train the largest model from scratch?

w11wo · June 26, 2021, 8:17am

Correct me if I’m wrong, but what I have read in the official announcement is that we are going to get TPUv3-8 VM, which has 8 cores in total. Though I’m not entirely sure of its maximum capability to train a very large language model, other proposals have similarly proposed to pre-train language models from scratch as well, only in a different language. Let me know if I’m wrong though.

Cheers!

chewkokwah · June 27, 2021, 4:41am

Yes, it is TPU V3-8 with 4 TPUs and 8 cores, equivalent to 4 V100 GPUs.
Anyway, regardless of it capability we can still give it a try, at least can gain some learning experience.
I am in for this project.

valhalla · June 28, 2021, 5:03pm

Awesome! let’s officially define this project

Putting everybody in the official sheet here. More people can still join! Leave a comment here or on the sheet if you want to change something.

Topic		Replies	Views
RobIt : PreTrain RoBERTa-base from scratch in Italian Flax/JAX Projects	4	478	June 29, 2021
PreTrain RoBERTa from scratch in Thai Flax/JAX Projects	3	647	July 2, 2021
Pretrain RoBERTa-large from scratch in Swedish Flax/JAX Projects	2	1064	July 5, 2021
Pretrain RoBERTa-large from scratch in Finnish Flax/JAX Projects	1	352	June 29, 2021
PreTrain RoBERTa from scratch in Portuguese Flax/JAX Projects	16	2440	October 4, 2021