How to determine optimal batch & chunk size for MLM?

Hi there,

I am currently trying to optimise a pre-trained language model (BERT) by extending the pre-training with additional texts I have access to. This means that I download an existing BERT model using the transformers library, leave the tokenizer unchanged and then use masked language modelling (MLM) to train the already existing embeddings in the hope that the embeddings of the terms from the domain from which my additional texts come will adapt accordingly.

As I am still quite new to language modelling, I am currently a bit confused about determining the optimal chunk and batch sizes.
The texts that are available to me are of very different lengths. In a first step, I have written a script of my own to write the many different lines into lines of equal length with, for example, 100 or 200 tokens (-> chunk size 100 or 200). Then I load this data set into my training script.

My question is now as follows:
How do I find out which length of the chunks is optimal for the training? And how do I find out which batch size I should use?

Initial tests have shown that increasing the batch size from 8 to 128, for example, while keeping the dataset the same, considerably reduces the computing time, but apparently has no significant influence on the loss. The larger I choose the batch size, the more the GPUs are utilised. Is that the only thing that a larger batch size does: Reducing computation time by using the GPUs more efficiently?

In experiments with longer chunk sizes (e.g. from 100 to 500) but a constant batch size, the computing time also decreases significantly, which is due to the resulting lower number of lines in the training set. However, it seems that the performance suffers because the loss is greater. This confuses me, as I always thought that the possibility of feeding long samples into the model increases the accuracy of the model, but it seems to be the other way round: The shorter the rows of the training set, the worse the language model.

I would really appreciate an assessment of my problem. Thank you very much!

Don’t hestitate to ask something about my task if I didn’t explain it well enough.

Hey there.

I am also currently doing a similar experimentation and want to find the optimal batch and chunk size.

However, it seems that the performance suffers because the loss is greater.

Regarding this, i didn’t understand what you meant. Did you see the loss increases if longer sentences,. I e., Bigger chunk size is chosen??