PreTrain GPT2 for Sinhala from scratch

GPT2 for Sinhala

Currently, there is no GPT2 model that was trained from scratch for Sinhala on the hub: Hugging Face – The AI community building the future. For this project, the goal is to create a good language generation model for Sinhala.

Model

A randomly initialized GPT2 model

Datasets

OSCAR has ~850 MB dataset and it is also available through the datasets library here: oscar · Datasets at Hugging Face.
C4M by AllenAI has a 3GB dataset which could be another potential dataset.

Available training scripts

A causal language modeling script for Flax is available here. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a GPT2 model that is able to generate Sinhala language. A nice generation demo with streamlit can be created for this.

(Optional) Challenges

The dataset on OSCAR is comparably small: ~ 850MB. There is an option to use the C4M dataset to supplement that.

(Optional) Links to read upon

The most important read would be the following colab:

Let’s give it a try :slight_smile:

1 Like