PreTrain GPT2 from scratch in Persian

Pre-training GPT2 for Persian:
Description
We want to pre-train GPT-2 from scratch, as stated in the OpenAi paper, with high-quality data that will suit text generation task in the Persian language. There is 2 version of GPT-2 which are somehow small model, but we want to take a further step and train GPT-2 on a more extensive data-set and with more diverse and quality text.
Data-set
As part of the research effort, we are preparing a Persian MEgA data-set, including 50~70 GB of text, which is still in progress. We may also use the Oscar data set, which is already available on Huggingface’s data sets.
Challenges
Besides creating the data-set, the main challenge is to choose a tokenizer that suits our purpose in the Persian language and doesn’t mess up with the literature. We are still doing various research in this field, but up until now, our pick is the BPE tokenizer which is GPT’s default tokenizer.
Training scripts
We will use some ideas like (here) and here and also do some modification to suits our purpose.
Insight
Besides generating high-quality text, with a big GPT model as a reference, we can fine-tune it on several tasks like chatbots(the data for this task will publish soon by our team).

2 Likes

@patrickvonplaten
First of all, thank you and Huggingface so much for this opportunity.

As I mentioned, we already prepare some data, and the rest is still in progress. I was wondering is it possible to add additional data to the OSCAR data-set and what is your insight on doing such a thing.

Thanks again :slightly_smiling_face:

1 Like

@saied I’m interested!

1 Like

@ironcladgeek Wellcome,
thank you so much for teaming up with me, lad. :slightly_smiling_face:

1 Like

Awesome - 2 is definitely enough for now :slight_smile: Finalizing this project!

3 Likes

@patrickvonplaten Great, Thank you …

Sounds like a good project! In fact, I have some experiences in this case that would help! I’d like to participate.

2 Likes

Wonderfull @m3hrdadfi
Great to have you here :slightly_smiling_face:

1 Like

Added you :slight_smile:

1 Like

I’ve created a channel (#gpt-persian) on Discord to talk about this awesome project! Plz, join us!

1 Like

@mazy1998
the link of channel:

1 Like

@mazy1998 can you try this link

2 Likes

Hi guys! I speak Farsi too and would like to contribute in this project, I am a data science lead in Bayer and work on both vision and NLP.

2 Likes

Adding you :slight_smile:

Hey Guys,
As Persian guy and a Big fan of huggingface pipeline.
I’m so interesting to add to this project

If there is an empty spot, I would also like to join!