PreTrain RoBERTa from scratch in Hindi

patrickvonplaten · June 23, 2021, 11:32am

RoBERTa/BERT for Hindi

Currently, there is only a very limited amount of BERT-like models for Hindi on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a RoBERTa/BERT model for just the Hindi language.

Model

A randomly initialized RoBERTa/BERT model

Datasets

One can make use OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

Available training scripts

A masked language modeling script for Flax is available here. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a strong RoBERTa/BERT model in Hindi.

(Optional) Challenges

The OSCAR dataset might be too small (it has < 10GB of data for Hindi). Also it might be important
to find datasets the BERT-like model can be evaluated on after pretraining in Hindi. Having found a dataset to fine-tune the pretrained BERT-like model on, one can make use of the text-classification script here

(Optional) Links to read upon

The most important read would be the following colab:

Google Colaboratory

amankhandelia · June 23, 2021, 5:15pm

I am planning to work on this project.

So, here is my first question

As the current dataset size is too small for it to effectively train BERT and planning on building my own dataset, by scrapping website that host content in Hindi.
The first thing that comes to mind is scrapping news article, which perpetual source of data, we can also tap into some social networks which have more content in Hindi, such as ShareChat. Hindi/Hindustani Literature websites

Does this approach sounds reasonable, or is too weak because common crawl could have already covered such websites?

valhalla · June 23, 2021, 5:23pm

Hi @amankhandelia

For Hindi, the OSCAR corpus has almost 8GB which will be good-enough for a BERT like model. IndicCorp also has a Hindi corpus.

And then there’s mC4 which has almost 104GB of Hindi data which should be more than enough for this project

Also, here’s an interesting insight from CamemBERT paper
camem

sbmaruf · June 23, 2021, 5:26pm

Some additional resources for hindi,

mC4 - 105 GB
CC-100 : 2.5 GB compressed.

I don’t think data will be a problem.

amankhandelia · June 23, 2021, 11:55pm

In that case, I stand corrected.

What are some downstream tasks with corresponding dataset in Hindi on which we can test such model’s efficacy, see if increasing data volume is having an effect on performance of such tasks?

sbmaruf · June 24, 2021, 12:13am

I can recall at least 6 datasets with Hindi evaluation.
See here in the XTREME Benchmark, https://arxiv.org/pdf/2003.11080.pdf

valhalla · June 24, 2021, 5:17am

There’s IndicGLUE benchmark here which has quite a few downstream tasks.

mlkorra · June 25, 2021, 4:04pm

I am interested to work on this project

hassiahk · June 28, 2021, 11:02am

I am interested in working on this project. Can I take part?

valhalla · June 28, 2021, 11:12am

yes, anyone can participate

patrickvonplaten · June 28, 2021, 3:55pm

Great - let’s officially define this project then

Putting everybody in the official sheet here . @sbmaruf - I’ve put you in the group for now as well, feel free to leave a comment in the google sheet to be removed if you don’t want to be in the group. I wasn’t sure

sbmaruf · June 29, 2021, 8:53am

Hi! I will be doing this for Bangla. Probably it’s better I focus more there.

ramjaju · June 29, 2021, 6:23pm

Hi, I am interested in working on this project. Can I be a participate if slots are open ? BTW I am new to JAX.

amankhandelia · June 30, 2021, 5:35am

Hey everyone who is participating in this project, you can join the #roberta-pretraining-hindi project over on Discord, we can start formulating how to go about it, all the datasets we will use and any downstream task on which we might wish to check the performance of our trained model

valhalla · June 30, 2021, 8:09am

Yes! Added you to the team!

And don’t worry about JAX, we have an awesome speaker line-up that will cover JAX and we and the community will be there to help with any JAX-related questions

hassiahk · June 30, 2021, 1:44pm

Hey. Which Discord? Can you share me an invite?

amankhandelia · June 30, 2021, 2:03pm

Yeah sure, Flax-HuggingFace-Community-Week

hassiahk · June 30, 2021, 2:06pm

Thanks man

amankhandelia · June 30, 2021, 2:07pm

Hey, I updated link, shared the generic link, this one will directly land you in the group

skylord · June 30, 2021, 2:21pm

I am interested. Pls share the link to join. Thanks.

Topic		Replies	Views
PreTrain RoBERTa from scratch in Thai Flax/JAX Projects	3	647	July 2, 2021
PReTrain RoBERTa from scratch in Norwegian Flax/JAX Projects	2	880	June 28, 2021
PreTrain RoBERTa from scratch in Indonesian Flax/JAX Projects	6	549	June 28, 2021
PreTrain RoBERTa from scratch in Marathi Flax/JAX Projects	7	921	July 7, 2021
PreTrain RoBERTa from scratch in Portuguese Flax/JAX Projects	16	2432	October 4, 2021