PreTrain RoBERTa from scratch in Marathi

RoBERTa/BERT for Marathi

Currently, there are no BERT-like models for Marathi on the hub. For this project, the goal is to create a RoBERTa/BERT model for just the Marathi language.

Model

A randomly initialized RoBERTa/BERT model

Datasets

One can make use A colossal, cleaned version of Common Crawl’s web crawl corpus - Marathi dataset. Marathi corpus is ~70GB so should be great start for training a new language model.

Available training scripts

A masked language modeling script for Flax is available - huggingface/transformers repo: examples/flax/language-modeling/run_mlm_flax.py. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a strong RoBERTa/BERT model in Marathi.

(Optional) Links to read upon

The most important read would be the following colab:

  • Download C4 dataset - allenai/allennlp/discussions/5056

We are pre-training RoBERTa for Hindi… maybe I can help…

How are we gonna evaluate this? Any benchmark datasets available for Marathi?

Found some relevent work on Marathi language:

Let’s define it

It would be a good start I guess. I will help :slight_smile:

@nipunsadvilkar, mc4 is now available in HF datasets API as well… We can directly use it.

And we got access to TPU VM as well. I have created this channel #roberta-pretraining-marathi on the discord. Let’s talk more there.

1 Like

Great! Would you mind talking over slack channel? I will create #roberta-pretraining-marathi on slack then

That’s fine as well :))