PreTrain RoBERTa from scratch in Hindi

RoBERTa/BERT for Hindi

Currently, there is only a very limited amount of BERT-like models for Hindi on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a RoBERTa/BERT model for just the Hindi language.


A randomly initialized RoBERTa/BERT model


One can make use OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

Available training scripts

A masked language modeling script for Flax is available here. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a strong RoBERTa/BERT model in Hindi.

(Optional) Challenges

The OSCAR dataset might be too small (it has < 10GB of data for Hindi). Also it might be important
to find datasets the BERT-like model can be evaluated on after pretraining in Hindi. Having found a dataset to fine-tune the pretrained BERT-like model on, one can make use of the text-classification script here

(Optional) Links to read upon

The most important read would be the following colab:


I am planning to work on this project.

So, here is my first question

As the current dataset size is too small for it to effectively train BERT and planning on building my own dataset, by scrapping website that host content in Hindi.
The first thing that comes to mind is scrapping news article, which perpetual source of data, we can also tap into some social networks which have more content in Hindi, such as ShareChat. Hindi/Hindustani Literature websites

Does this approach sounds reasonable, or is too weak because common crawl could have already covered such websites?

1 Like

Hi @amankhandelia

For Hindi, the OSCAR corpus has almost 8GB which will be good-enough for a BERT like model. IndicCorp also has a Hindi corpus.

And then there’s mC4 which has almost 104GB of Hindi data which should be more than enough for this project :slight_smile:

Also, here’s an interesting insight from CamemBERT paper


Some additional resources for hindi,

  1. mC4 - 105 GB
  2. CC-100 : 2.5 GB compressed.

I don’t think data will be a problem.


In that case, I stand corrected.

What are some downstream tasks with corresponding dataset in Hindi on which we can test such model’s efficacy, see if increasing data volume is having an effect on performance of such tasks?

I can recall at least 6 datasets with Hindi evaluation.
See here in the XTREME Benchmark,


There’s IndicGLUE benchmark here which has quite a few downstream tasks.


I am interested to work on this project

I am interested in working on this project. Can I take part?

yes, anyone can participate :slight_smile:

Great - let’s officially define this project then :slight_smile:

Putting everybody in the official sheet here . @sbmaruf - I’ve put you in the group for now as well, feel free to leave a comment in the google sheet to be removed if you don’t want to be in the group. I wasn’t sure :slight_smile:

Hi! I will be doing this for Bangla. Probably it’s better I focus more there.

1 Like

Hi, I am interested in working on this project. Can I be a participate if slots are open ? BTW I am new to JAX.

Hey everyone who is participating in this project, you can join the #roberta-pretraining-hindi project over on Discord, we can start formulating how to go about it, all the datasets we will use and any downstream task on which we might wish to check the performance of our trained model

Yes! Added you to the team!

And don’t worry about JAX, we have an awesome speaker line-up that will cover JAX and we and the community will be there to help with any JAX-related questions :slight_smile:


Hey. Which Discord? Can you share me an invite?

Yeah sure, Flax-HuggingFace-Community-Week

Thanks man :slight_smile:

Hey, I updated link, shared the generic link, this one will directly land you in the group

I am interested. Pls share the link to join. Thanks.

1 Like