PreTrain GPT2 from scratch in Indonesia

GPT2 for Indonesia

Currently, there is only few GPT2 model that was trained from scratch for Indonesian (small size only) on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a strong language generation model for Indonesia.


A randomly initialized GPT2 model


One can make use OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

A portion mc4 [MC4] (The C4 Multilingual Dataset · Discussion #5265 · allenai/allennlp · GitHub) and
CC100 [CC-100: Monolingual Datasets from Web Crawl Data] can also be used for this task

Available training scripts

A causal language modeling script for Flax is available here . It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a GPT2 model that is able to generate Indonesia language.

(Optional) Challenges

The dataset on MC4 and CC100 is quite large → around 350 GB. One might want to try out dataset streaming to solve this problem. Dataset streaming will be merged to datasets in a couple of days. See PR here and the docs here.

Since mc4 not available in datasets yet, adjustment for streaming dataset could be required

(Optional) Links to read upon

The most important read would be the following colab:

Thanks for creating the proposal, I am in.

Thanks for the proposal, count me in!

Hi, thank you for the proposal, I am in.

Thanks for the proposal, I’m in!

Hopefully we can pretrain gpt2-large using the datasets mentioned above within only one week :slight_smile:
Btw, to make our discussion easier, we will use again telegram.

Cool idea and great description - we can probably already fix this team on Monday :slight_smile:

We received additional 1.2GB dataset of “indonesian-journals”, which is a collection of journals/papers from several indonesian universities.

Awesome, let’s officially define this project! :raised_hands:
Added everybody in the official sheet here. :slightly_smiling_face:

Thanks @Suzana. May I ask you to add also @munggok and @ayameRushia to the official sheet please (I can’t update it my self)? Thanks.

May I join too?

Of course :slight_smile:

Done! @cahya

Hi, thank you for the proposal, Count me in.

Done! @muhammadfhadli

The tutorial on towardsdatascience specified above is outdated, here’s I have made the updated version in colab :wink: Google Colaboratory