Currently, there is no GPT2 model that was trained from scratch for Bengali on the hub. For this project, the goal is to create a strong language generation model for Bengali using GPT2 Model.
A randomly initialized GPT2 model.
One can make use of OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face. The total Bengali resource in OSCAR is 11 GB.
Another source can be the mC4 dataset which is available in AllenAI. The resource size is 29GB.
A causal language modeling script for Flax is available here. It can be tweaked for training GPT2.
- Fix a good tokenizer that covers Bengali vocabulary properly and make sure that the LM doesn’t become character-level LM.
The desired project output is a GPT2 model that is able to generate Bengali language.
The most important read would be the following colab:
Other reads that might be interesting include: