1. Pre-train Albert from Scratch for the Persian Language
Currently, there are some open-source language models for the Farsi language. We Crawl some new samples as a new dataset which I want to use for pre-training Albert for the farsi language from scratch.
2. Language
Persian/Farsi
3. Model
ALBERT xLarge
4. Datasets
Wikipedia dump
Common Crawl dump
random web scraps
5. Training scripts
There are already Flax scripts to pre-train Albert that we can easily use:
transformers/examples/flax/language-modeling at master · huggingface/transformers · GitHub
6. Challenges
Using Dataset pipeline (ETL approach) for new txt crawled file