RoBERT model for Sinhala Language

Anne · May 25, 2021, 3:35am

Hi
I am tring to build RoBERT model for Sinhala language

My final Training data set is as follows

no of words = 64 129 561

no of sentences = 5 134 347

Size of the file = 938,019 KB

I already created the Bert Tokenizer using my Training dataset. (Size of the tokenizer is 1644 KB + 1299 KB)

Now Im trying to Train the model using google colab

since the dataset is large I divided the dataset into 10 subsets.

but it seems that for 1epoch the training time is really large around 270h ( But the remaining time seems to decrease quickly, within 20mins it dropped down from 290h to 250h).

Sometimes the program gets crashed after 10 or 20 mins

I created the model by referring the following link. (I am using exact same code, this code execute in about 3h for 1 epoch)

Is this happens because the created tokenizer is big ?

Is there a better way to do this?

Reference code

Topic		Replies	Views
Huggingface distilbert-base-uncased-finetuned-sst-2-english runs out of ram with only a few kb? Beginners	0	373	May 12, 2022
PreTrain RoBERTa from scratch in Hindi Flax/JAX Projects	24	2043	December 10, 2021
Amharic NLP - Train BERT-style model Models	3	347	March 1, 2021
Further pre-train roberta model Beginners	1	1390	July 14, 2020
How to train a gpt2 with colab pro Models	16	3710	February 29, 2024

RoBERT model for Sinhala Language

Related topics