PreTrain T5 from scratch in Bengali

sbmaruf · June 23, 2021, 3:28pm

T5 for Bengali

Currently, there is no T5 model that was trained from scratch for Bengali on the hub. For this project, the goal is to create a strong language generation model for Bengali using T5 Model.

2. Language

Bengali.

3. Model

A randomly initialized T5 model.

4. Datasets

One can make use of OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face. The total Bengali resource in OSCAR is 11 GB.

Another source can be the mC4 dataset which is available in AllenAI. The resource size is 29GB.

5. Training scripts

A causal language modeling script for Flax is available here. It can be tweaked for training T5.

6. Challenges

Adapt the training script to T5
Fix a good tokenizer that covers Bengali vocabulary properly and make sure that the LM doesn’t become character-level LM.

7. Desired project outcome

The desired project output is a T5 model that is able to generate Bengali language.

8. Reads

The most important read would be the following colab:

Google Colaboratory

Apart from that we may need to look at the seqio library and source code of T5 here,

tasnim · June 23, 2021, 3:57pm

I am also a Bengali speaker. I am in!

patrickvonplaten · June 29, 2021, 2:36pm

Sounds great, let’s finalize it!

trtm · April 4, 2022, 9:45am

Any update on this? Is there a repo for this project?

StephennFernandes · May 17, 2022, 4:00pm

@sbmaruf any update on this ? were you able to use seqio for T5 in the huggingface training script ?

sbmaruf · July 26, 2022, 10:13pm

We trained T5 with hggingface flax script, but the performance of the the language model on downstream task was very poor.

With the same data, the causal model worked fine.

More on T5 convergence issues I found later on.
https://github.com/huggingface/transformers/issues/13335

Topic		Replies	Views
PreTrain GPT2 from scratch in Bengali Flax/JAX Projects	8	2429	August 19, 2021
PreTrain RoBERTa from scratch in Hindi Flax/JAX Projects	24	2043	December 10, 2021
PreTrain T5 for Italian 🇮🇹 Flax/JAX Projects	3	618	July 7, 2021
Example of how to pretrain T5? 🤗Transformers	15	15992	March 16, 2023
Pretrain T5 for Arabic Flax/JAX Projects	17	2683	June 11, 2023