BART generation with shorter input sequences on pre-training task

meliksahturker · January 25, 2023, 10:12am

I pre-trained a BART-large model with the task of sentence shuffling + text infilling as in the original paper.
It is trained on my own dataset, for 400K steps with batch size of 256.
Training and loss curve went fine and I see the model does its job fine on reconstructing corrupted inputs.

However, when I corrupt a shorter input sequence, e.g. 5 to 10 sentences, the model output is garbage.

Following the approach proposed in T5 paper Section 3.1.2 Training, second paragraph, during training, I merge shorter sequences to form almost full size sequences to save computation on padding tokens.
I suspect this is the reason the model performs poorly on shorter input sequences as it was always trained on almost full size sequences.
Yet this was quite unexpected for me.

So the things I want to discuss are:

Is this behavior normal/expected?
Has anyone come across something similar in perhaps other similar models, e.g. T5, Pegasus, etc?
Does it even matter that the model performs poorly on shorter inputs on pre-training task? After all the reason of pre-training is to fine-tune it later.

Topic		Replies	Views
Finetuning BART on a multi-input sequence to sequence task 🤗Transformers	0	733	September 22, 2021
Does generate's max_length influence training? 🤗Transformers	0	103	April 25, 2024
PreTrain BART on The Pile Flax/JAX Projects	19	1636	July 1, 2021
BART model fine-tuning give unexpected not relevant results Beginners	0	359	July 23, 2021
Pretraining BART for conditional generation 🤗Transformers	1	978	May 30, 2022

BART generation with shorter input sequences on pre-training task

Related topics