Currently, there is no GPT2 model that was trained from scratch for Sinhala on the hub: Hugging Face – The AI community building the future. For this project, the goal is to create a good language generation model for Sinhala.
A randomly initialized GPT2 model
A causal language modeling script for Flax is available here. It can be used pretty much without any required code changes.
The desired project output is a GPT2 model that is able to generate Sinhala language. A nice generation demo with streamlit can be created for this.
The dataset on OSCAR is comparably small: ~ 850MB. There is an option to use the C4M dataset to supplement that.
The most important read would be the following colab: