Pretrained GPT2 for Tamil

GPT2 for Tamil

We aim to create a GPT2 model for tamil language.

Model

A randomly initialized GPT2 model

Datasets

OSCAR dataset (https://oscar-corpus.com/) has about 9 GB of training data and Indic-NLP (IndicCorp | AI4Bharat IndicNLP) has 500 million token dataset.

Available training scripts

A causal language modeling script for Flax is available [here ]It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a GPT2 model that is able to generate Tamil language.

4 Likes

@rays2pix : Hi Deepak. I was hoping that someone would propose a project for Tamil. I would like to collaborate on this!

1 Like

That’s nice @abinayam . We can do this if this gets picked . What do you think @patrickvonplaten

1 Like

I cannot help much, but I would be excited to see this happen!

3 Likes

Awesome! Let’s officially define it :slight_smile:

3 Likes

Great. @abinayam let’s discuss on discord #gpt-tamil @100worte .

1 Like

Link for discord Flax-HuggingFace-Community-Week

1 Like

I cannot contribute to it. But wish you all the best! :innocent: :100: :partying_face:

2 Likes

Thanks @rays2pix for introducing me to this post . I would like to collaborate if there is room

1 Like

@adithya1111: You’re welcome to join the group. Please join the discord channel.

1 Like

Hi @abinayam though I may not be able to contribute, I would like to join the discussion in the discord channel, could you please share the link?

Hi @cishwarya: Here is the channel name in discord #gpt-tamil.

Hi @abinayam thanks for your reply. I do not have the invite for the discord group. Could you please send me an invite…thank you so much…

Here is the invite for the discord group: Flax-HuggingFace-Community-Week