PreTrain RoBERTa from scratch in Thai

patrickvonplaten · June 23, 2021, 11:41am

RoBERTa/BERT for Thai

Currently, there is only a very limited amount of BERT-like models for Thai on the hub: Hugging Face – The AI community building the future. . For this project, the goal is to create a RoBERTa/BERT model for just the Thai language.

Model

A randomly initialized RoBERTa/BERT model

Datasets

One can make use OSCAR the dataset is also available through the datasets library here: oscar · Datasets at Hugging Face.

Available training scripts

A masked language modeling script for Flax is available here. It can be used pretty much without any required code changes.

(Optional) Desired project outcome

The desired project output is a strong RoBERTa/BERT model in Thai.

(Optional) Challenges

The OSCAR dataset might be too small (it has < 20GB of data for Thai). Also it might be important
to find datasets the BERT-like model can be evaluated on after pretraining in Thai. Having found a dataset to fine-tune the pretrained BERT-like model on, one can make use of the text-classification script here

(Optional) Links to read upon

The most important read would be the following colab:

Google Colaboratory

sakares · June 30, 2021, 3:00am

Hi there. I have tried with this blog for PyTorch before and would love to do this in Flax/Jax too

I will take a look and update the status back soon.

patrickvonplaten · July 2, 2021, 3:36pm

Let’s define it

sakares · July 2, 2021, 4:55pm

Alright, as far as I know once LM is ready, we will start with checking the downstream task to an existing benchmark like this one

Topic		Replies	Views
PReTrain RoBERTa from scratch in Norwegian Flax/JAX Projects	2	880	June 28, 2021
PreTrain RoBERTa from scratch in Hindi Flax/JAX Projects	24	2043	December 10, 2021
PreTrain RoBERTa from scratch in Indonesian Flax/JAX Projects	6	549	June 28, 2021
PreTrain RoBERTa from scratch in Portuguese Flax/JAX Projects	16	2408	October 4, 2021
RobIt : PreTrain RoBERTa-base from scratch in Italian Flax/JAX Projects	4	478	June 29, 2021