GPT2 for Tamil
We aim to create a GPT2 model for tamil language.
Model
A randomly initialized GPT2 model
Datasets
OSCAR dataset (https://oscar-corpus.com/) has about 9 GB of training data and Indic-NLP (IndicCorp | AI4Bharat IndicNLP) has 500 million token dataset.
Available training scripts
A causal language modeling script for Flax is available [here ]It can be used pretty much without any required code changes.
(Optional) Desired project outcome
The desired project output is a GPT2 model that is able to generate Tamil language.
4 Likes
@rays2pix : Hi Deepak. I was hoping that someone would propose a project for Tamil. I would like to collaborate on this!
2 Likes
That’s nice @abinayam . We can do this if this gets picked . What do you think @patrickvonplaten
1 Like
I cannot help much, but I would be excited to see this happen!
3 Likes
Awesome! Let’s officially define it
3 Likes
Great. @abinayam let’s discuss on discord #gpt-tamil @100worte .
1 Like
Thanks @rays2pix for introducing me to this post . I would like to collaborate if there is room
1 Like
@adithya1111: You’re welcome to join the group. Please join the discord channel.
1 Like
Hi @abinayam though I may not be able to contribute, I would like to join the discussion in the discord channel, could you please share the link?
Hi @cishwarya: Here is the channel name in discord #gpt-tamil.
Hi @abinayam thanks for your reply. I do not have the invite for the discord group. Could you please send me an invite…thank you so much…
Here is the invite for the discord group: Flax-HuggingFace-Community-Week