How to determine optimal batch & chunk size for MLM?

nicir · August 15, 2022, 5:11pm

Hi there,

I am currently trying to optimise a pre-trained language model (BERT) by extending the pre-training with additional texts I have access to. This means that I download an existing BERT model using the transformers library, leave the tokenizer unchanged and then use masked language modelling (MLM) to train the already existing embeddings in the hope that the embeddings of the terms from the domain from which my additional texts come will adapt accordingly.

As I am still quite new to language modelling, I am currently a bit confused about determining the optimal chunk and batch sizes.
The texts that are available to me are of very different lengths. In a first step, I have written a script of my own to write the many different lines into lines of equal length with, for example, 100 or 200 tokens (-> chunk size 100 or 200). Then I load this data set into my training script.

My question is now as follows:
How do I find out which length of the chunks is optimal for the training? And how do I find out which batch size I should use?

Initial tests have shown that increasing the batch size from 8 to 128, for example, while keeping the dataset the same, considerably reduces the computing time, but apparently has no significant influence on the loss. The larger I choose the batch size, the more the GPUs are utilised. Is that the only thing that a larger batch size does: Reducing computation time by using the GPUs more efficiently?

In experiments with longer chunk sizes (e.g. from 100 to 500) but a constant batch size, the computing time also decreases significantly, which is due to the resulting lower number of lines in the training set. However, it seems that the performance suffers because the loss is greater. This confuses me, as I always thought that the possibility of feeding long samples into the model increases the accuracy of the model, but it seems to be the other way round: The shorter the rows of the training set, the worse the language model.

I would really appreciate an assessment of my problem. Thank you very much!

Don’t hestitate to ask something about my task if I didn’t explain it well enough.

Sirius8458234243 · January 5, 2023, 12:11am

Hey there.

I am also currently doing a similar experimentation and want to find the optimal batch and chunk size.

However, it seems that the performance suffers because the loss is greater.

Regarding this, i didn’t understand what you meant. Did you see the loss increases if longer sentences,. I e., Bigger chunk size is chosen??

Topic		Replies	Views
Chunks and batches in MLMs Beginners	1	1755	June 22, 2023
How to choose optimal batch size for training LLMs? Intermediate	4	18582	December 18, 2023
Using Batch Encodings 🤗Transformers	0	684	July 12, 2022
How to specify different batch sizes for different GPUs when training with rum_mlm.py? Beginners	1	1102	July 26, 2021
Bert LM pretraining: training loss goes to 0 at masking probability of 0.999 Beginners	2	2314	October 31, 2020

How to determine optimal batch & chunk size for MLM?

Related topics