T5 for Bengali
Currently, there is no T5 model that was trained from scratch for Bengali on the hub. For this project, the goal is to create a strong language generation model for Bengali using T5 Model.
2. Language
Bengali.
3. Model
A randomly initialized T5 model.
4. Datasets
One can make use of OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face. The total Bengali resource in OSCAR is 11 GB.
Another source can be the mC4 dataset which is available in AllenAI. The resource size is 29GB.
5. Training scripts
A causal language modeling script for Flax is available here. It can be tweaked for training T5.
6. Challenges
- Adapt the training script to T5
- Fix a good tokenizer that covers Bengali vocabulary properly and make sure that the LM doesn’t become character-level LM.
7. Desired project outcome
The desired project output is a T5 model that is able to generate Bengali language.
8. Reads
The most important read would be the following colab:
Apart from that we may need to look at the seqio
library and source code of T5 here,