Hi
I am tring to build RoBERT model for Sinhala language
My final Training data set is as follows
no of words = 64 129 561
no of sentences = 5 134 347
Size of the file = 938,019 KB
I already created the Bert Tokenizer using my Training dataset. (Size of the tokenizer is 1644 KB + 1299 KB)
Now Im trying to Train the model using google colab
since the dataset is large I divided the dataset into 10 subsets.
but it seems that for 1epoch the training time is really large around 270h ( But the remaining time seems to decrease quickly, within 20mins it dropped down from 290h to 250h).
Sometimes the program gets crashed after 10 or 20 mins
I created the model by referring the following link. (I am using exact same code, this code execute in about 3h for 1 epoch)
Is this happens because the created tokenizer is big ?
Is there a better way to do this?
Reference code