Currently, there is no T5 model that was trained from scratch for Bengali on the hub. For this project, the goal is to create a strong language generation model for Bengali using T5 Model.
A randomly initialized T5 model.
One can make use of OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face. The total Bengali resource in OSCAR is 11 GB.
Another source can be the mC4 dataset which is available in AllenAI. The resource size is 29GB.
A causal language modeling script for Flax is available here. It can be tweaked for training T5.
- Adapt the training script to T5
- Fix a good tokenizer that covers Bengali vocabulary properly and make sure that the LM doesn’t become character-level LM.
The desired project output is a T5 model that is able to generate Bengali language.
The most important read would be the following colab:
Apart from that we may need to look at the
seqio library and source code of T5 here,