Currently, there are no BERT-like models for Marathi on the hub. For this project, the goal is to create a RoBERTa/BERT model for just the Marathi language.
A randomly initialized RoBERTa/BERT model
One can make use A colossal, cleaned version of Common Crawl’s web crawl corpus - Marathi dataset. Marathi corpus is ~70GB so should be great start for training a new language model.
A masked language modeling script for Flax is available - huggingface/transformers repo: examples/flax/language-modeling/run_mlm_flax.py. It can be used pretty much without any required code changes.
The desired project output is a strong RoBERTa/BERT model in Marathi.
The most important read would be the following colab:
- Download C4 dataset - allenai/allennlp/discussions/5056