PreTrain GPT2 from scratch in Indonesia

munggok · June 24, 2021, 6:52pm

GPT2 for Indonesia

Currently, there is only few GPT2 model that was trained from scratch for Indonesian (small size only) on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a strong language generation model for Indonesia.

Model

A randomly initialized GPT2 model

Datasets

One can make use OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

A portion mc4 [MC4] (The C4 Multilingual Dataset · Discussion #5265 · allenai/allennlp · GitHub) and
CC100 [CC-100: Monolingual Datasets from Web Crawl Data] can also be used for this task

Available training scripts

A causal language modeling script for Flax is available here . It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a GPT2 model that is able to generate Indonesia language.

(Optional) Challenges

The dataset on MC4 and CC100 is quite large → around 350 GB. One might want to try out dataset streaming to solve this problem. Dataset streaming will be merged to datasets in a couple of days. See PR here and the docs here.

Since mc4 not available in datasets yet, adjustment for streaming dataset could be required

(Optional) Links to read upon

The most important read would be the following colab:

cahya · June 24, 2021, 8:11pm

Thanks for creating the proposal, I am in.

Galuh · June 24, 2021, 11:16pm

Thanks for the proposal, count me in!

Wikidepia · June 25, 2021, 7:27am

Thanks for the proposal, I’m in!

cahya · June 25, 2021, 8:01am

Hopefully we can pretrain gpt2-large using the datasets mentioned above within only one week
Btw, to make our discussion easier, we will use again telegram.

patrickvonplaten · June 25, 2021, 5:30pm

Cool idea and great description - we can probably already fix this team on Monday

cahya · June 28, 2021, 11:18pm

We received additional 1.2GB dataset of “indonesian-journals”, which is a collection of journals/papers from several indonesian universities.

Suzana · June 29, 2021, 7:26am

Awesome, let’s officially define this project!
Added everybody in the official sheet here.

cahya · June 29, 2021, 9:18am

Thanks @Suzana. May I ask you to add also @munggok and @ayameRushia to the official sheet please (I can’t update it my self)? Thanks.

ayameRushia · June 29, 2021, 9:30am

May I join too?

cahya · June 29, 2021, 9:35am

Of course

Suzana · June 29, 2021, 11:15am

Done! @cahya

muhammadfhadli · June 29, 2021, 5:58pm

Hi, thank you for the proposal, Count me in.

patrickvonplaten · June 30, 2021, 12:31pm

Done! @muhammadfhadli

Topic		Replies	Views
PreTrain GPT2 from scratch in Spanish Flax/JAX Projects	12	1967	July 1, 2021
PreTrain GPT2 from scratch in Russian Flax/JAX Projects	1	674	July 1, 2021
PreTrain GPT2 from scratch in Bengali Flax/JAX Projects	8	2431	August 19, 2021
PreTrain GPT2-Large (and/or GPT2-XL) from scratch in Portuguese Flax/JAX Projects	0	736	June 24, 2021
Pretrain GPT-2 from scratch in Thai Flax/JAX Projects	0	921	July 18, 2021