Model for Bengali Poetry Generation with GPT-2

GPT-2 for Bengali Poetry Generation

Description

In this project, I propose to generate Bengali poetry using a GPT-2 model trained from scratch.

As TPUv3-8 will be used, this should not take much time.

After a model has been trained, I will use the model to fine-tune with poems written in Bengali.

Datasets

The model can leverage multiple Datasets:

  1. The Bangla Wikipedia Corpus dataset available on Kaggle
  2. Poems available in the public domain of poets such as Jibananda Dash, Rabindranath Tagore, Michael Madhusudan Dutt, etc.

Available Training Scripts

I believe this training script can be used to train the model.

Desired project outcome

The goal of the project would be to be able to generate Bengali poetry sensible enough to fool humans into thinking a poet wrote it.

Challenges

  • As there is no pretrained GPT-2 model that exists, it might take a while longer to train it from scratch, and then fine-tuning it poetry.
  • The format of the data is different for different websites.
  • Some authors (Tagore) have much more voluminous contributions to the literature than others (Madhusudan Dutt). Their writing style will disproportionately influence the model.

Suggested Reading

1 Like

@patrickvonplaten any update on this project being greenlit?

Let’s define it!

1 Like

You might want to combine it with the pretraining gpt2 bengali project too. discord channel : #bengali-GPT2

1 Like

I was thinking about reaching out to them, actually.

It would be great to collaborate and share knowledge on common tasks- viz. training GPT-2 for Bengali.

Thanks. I will drop by later.

@Tahsin-Mayeesha