RoBERTa/BERT for Marathi
Currently, there are no BERT-like models for Marathi on the hub. For this project, the goal is to create a RoBERTa/BERT model for just the Marathi language.
Model
A randomly initialized RoBERTa/BERT model
Datasets
One can make use A colossal, cleaned version of Common Crawl’s web crawl corpus - Marathi dataset. Marathi corpus is ~70GB so should be great start for training a new language model.
Available training scripts
A masked language modeling script for Flax is available - huggingface/transformers repo: examples/flax/language-modeling/run_mlm_flax.py. It can be used pretty much without any required code changes.
(Optional) Desired project outcome
The desired project output is a strong RoBERTa/BERT model in Marathi.
(Optional) Links to read upon
The most important read would be the following colab:
- Download C4 dataset - allenai/allennlp/discussions/5056