GPT-2 for Bengali Poetry Generation
Description
In this project, I propose to generate Bengali poetry using a GPT-2 model trained from scratch.
As TPUv3-8 will be used, this should not take much time.
After a model has been trained, I will use the model to fine-tune with poems written in Bengali.
Datasets
The model can leverage multiple Datasets:
- The Bangla Wikipedia Corpus dataset available on Kaggle
- Poems available in the public domain of poets such as Jibananda Dash, Rabindranath Tagore, Michael Madhusudan Dutt, etc.
Available Training Scripts
I believe this training script can be used to train the model.
Desired project outcome
The goal of the project would be to be able to generate Bengali poetry sensible enough to fool humans into thinking a poet wrote it.
Challenges
- As there is no pretrained GPT-2 model that exists, it might take a while longer to train it from scratch, and then fine-tuning it poetry.
- The format of the data is different for different websites.
- Some authors (Tagore) have much more voluminous contributions to the literature than others (Madhusudan Dutt). Their writing style will disproportionately influence the model.
Suggested Reading
- Bengali Poem Generation Using Deep Learning Approach
- Poetic Machine: Computational Creativity for Automatic Poetry Generation in Bengali
- Sequence-to-sequence Bangla Sentence Generation with LSTM Recurrent Neural Networks
- Context-driven Bengali Text Generation using Conditional Language Model
- Training an AI to create poetry (NLP Zero to Hero - Part 6)
- NLP with Bangla: Generating Text and Sentiment Analysis