PreTrain GPT2 from scratch in Bengali

sbmaruf · June 23, 2021, 3:30pm

GPT2 for Bengali

Currently, there is no GPT2 model that was trained from scratch for Bengali on the hub. For this project, the goal is to create a strong language generation model for Bengali using GPT2 Model.

2. Language

Bengali.

3. Model

A randomly initialized GPT2 model.

4. Datasets

One can make use of OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face. The total Bengali resource in OSCAR is 11 GB.

Another source can be the mC4 dataset which is available in AllenAI. The resource size is 29GB.

5. Training scripts

A causal language modeling script for Flax is available here. It can be tweaked for training GPT2.

6. Challenges

Fix a good tokenizer that covers Bengali vocabulary properly and make sure that the LM doesn’t become character-level LM.

7. Desired project outcome

The desired project output is a GPT2 model that is able to generate Bengali language.

8. Reads

The most important read would be the following colab:

Google Colaboratory

Other reads that might be interesting include:

tasnim · June 23, 2021, 3:56pm

I am also a Bengali speaker. I am in!

khalidsaifullaah · June 23, 2021, 4:22pm

A Bengali text generation model would be totally great. This would be revolutionary for Bengali NLP research community!
I’m sooo in!!!

(I really hope this topic gets selected )

ibraheemmoosa · June 25, 2021, 3:10pm

Text generation is the dream.
BTW does it have to be GPT-2. Are there more efficient (compute or sample) models now?

patrickvonplaten · June 25, 2021, 5:20pm

This looks like a great topic! Let’s wait if more people are interested until Monday and make this an official team than on Monday

Tahsin-Mayeesha · June 27, 2021, 12:47pm

I’d like to join too. @sbmaruf @khalidsaifullaah

valhalla · June 28, 2021, 4:56pm

Let’s officially define this project

Putting everybody in the official sheet here. More people can still join! Leave a comment here or on the sheet if you want to change something.

tghosh93 · July 2, 2021, 3:07pm

I would like to join this project @valhalla @patrickvonplaten

DeadBeast · August 19, 2021, 2:26am

would like to join this project @valhalla

Topic		Replies	Views
Model for Bengali Poetry Generation with GPT-2 Flax/JAX Projects	4	2179	July 2, 2021
Pretrain GPT-2 from scratch in Thai Flax/JAX Projects	0	921	July 18, 2021
Pretrain GPT-2 from scratch in Mongolian Flax/JAX Projects	3	959	July 2, 2021
PreTrain GPT2 from scratch in Spanish Flax/JAX Projects	12	1966	July 1, 2021
PreTrain GPT2 from scratch in Indonesia Flax/JAX Projects	13	761	June 30, 2021