PreTrain RoBERTa from scratch in Marathi

nipunsadvilkar · July 6, 2021, 4:37pm

RoBERTa/BERT for Marathi

Currently, there are no BERT-like models for Marathi on the hub. For this project, the goal is to create a RoBERTa/BERT model for just the Marathi language.

Model

A randomly initialized RoBERTa/BERT model

Datasets

One can make use A colossal, cleaned version of Common Crawl’s web crawl corpus - Marathi dataset. Marathi corpus is ~70GB so should be great start for training a new language model.

Available training scripts

A masked language modeling script for Flax is available - huggingface/transformers repo: examples/flax/language-modeling/run_mlm_flax.py. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a strong RoBERTa/BERT model in Marathi.

(Optional) Links to read upon

The most important read would be the following colab:

Download C4 dataset - allenai/allennlp/discussions/5056

hassiahk · July 6, 2021, 4:44pm

We are pre-training RoBERTa for Hindi… maybe I can help…

How are we gonna evaluate this? Any benchmark datasets available for Marathi?

nipunsadvilkar · July 6, 2021, 4:49pm

Found some relevent work on Marathi language:

patrickvonplaten · July 6, 2021, 5:10pm

Let’s define it

hassiahk · July 6, 2021, 5:13pm

It would be a good start I guess. I will help

hassiahk · July 7, 2021, 1:08pm

@nipunsadvilkar, mc4 is now available in HF datasets API as well… We can directly use it.

And we got access to TPU VM as well. I have created this channel #roberta-pretraining-marathi on the discord. Let’s talk more there.

nipunsadvilkar · July 7, 2021, 1:11pm

Great! Would you mind talking over slack channel? I will create #roberta-pretraining-marathi on slack then

hassiahk · July 7, 2021, 1:13pm

That’s fine as well :))

Topic		Replies	Views
PreTrain RoBERTa from scratch in Hindi Flax/JAX Projects	24	2043	December 10, 2021
PreTrain RoBERTa from scratch in Portuguese Flax/JAX Projects	16	2432	October 4, 2021
PReTrain RoBERTa from scratch in Norwegian Flax/JAX Projects	2	880	June 28, 2021
PreTrain RoBERTa from scratch in Indonesian Flax/JAX Projects	6	549	June 28, 2021
BERTIN: PreTrain RoBERTa-large from scratch in Spanish Flax/JAX Projects	23	2005	July 19, 2021