Pre-train ALBERT from scratch for Persian/Farsi language

ghofrani · July 7, 2021, 1:42pm

1. Pre-train Albert from Scratch for the Persian Language

Currently, there are some open-source language models for the Farsi language. We Crawl some new samples as a new dataset which I want to use for pre-training Albert for the farsi language from scratch.

2. Language

Persian/Farsi

3. Model

ALBERT xLarge

4. Datasets

Wikipedia dump
Common Crawl dump
random web scraps

5. Training scripts

There are already Flax scripts to pre-train Albert that we can easily use:

transformers/examples/flax/language-modeling at master · huggingface/transformers · GitHub

6. Challenges

Using Dataset pipeline (ETL approach) for new txt crawled file

Mojtaba · July 7, 2021, 2:43pm

Sounds amazing, I’m also interested in participating.

ghofrani · July 7, 2021, 2:44pm

thanks sure, We are waiting for accept

patrickvonplaten · July 7, 2021, 6:06pm

Alright finalizing it!

alphareality · July 8, 2021, 3:00pm

I’m interested to join

m3hrdadfi · July 10, 2021, 4:21pm

I’d be happy to help you with this project!

@patrickvonplaten, do I need to request TPU access, or can I collaborate using the credential by one of the team members since I have access to the hf-flax group?

Topic		Replies	Views
PreTrain GPT2 from scratch in Persian Flax/JAX Projects	15	2101	July 7, 2021
Pretrain T5 for Arabic Flax/JAX Projects	17	2683	June 11, 2023
Pre-train RoBERTa from Scratch for Georgian Language Flax/JAX Projects	1	1264	July 7, 2021
PreTrain Swahili Flax model Flax/JAX Projects	9	406	June 30, 2021
PreTrain T5 for Italian 🇮🇹 Flax/JAX Projects	3	618	July 7, 2021