Currently, there are some open-source language models for the Farsi language. We Crawl some new samples as a new dataset which I want to use for pre-training Albert for the farsi language from scratch.
Common Crawl dump
random web scraps
There are already Flax scripts to pre-train Albert that we can easily use:
Using Dataset pipeline (ETL approach) for new txt crawled file