Pre-train ALBERT from scratch for Persian/Farsi language

1. Pre-train Albert from Scratch for the Persian Language

Currently, there are some open-source language models for the Farsi language. We Crawl some new samples as a new dataset which I want to use for pre-training Albert for the farsi language from scratch.

2. Language

Persian/Farsi

3. Model

ALBERT xLarge

4. Datasets

Wikipedia dump
Common Crawl dump
random web scraps

5. Training scripts

There are already Flax scripts to pre-train Albert that we can easily use:

transformers/examples/flax/language-modeling at master · huggingface/transformers · GitHub

6. Challenges

Using Dataset pipeline (ETL approach) for new txt crawled file

1 Like

Sounds amazing, I’m also interested in participating.

1 Like

thanks sure, We are waiting for accept

Alright finalizing it!

I’m interested to join

I’d be happy to help you with this project!

@patrickvonplaten, do I need to request TPU access, or can I collaborate using the credential by one of the team members since I have access to the hf-flax group?