I pre-trained a BART-large model with the task of sentence shuffling + text infilling as in the original paper.
It is trained on my own dataset, for 400K steps with batch size of 256.
Training and loss curve went fine and I see the model does its job fine on reconstructing corrupted inputs.
However, when I corrupt a shorter input sequence, e.g. 5 to 10 sentences, the model output is garbage.
Following the approach proposed in T5 paper Section 3.1.2 Training, second paragraph, during training, I merge shorter sequences to form almost full size sequences to save computation on padding tokens.
I suspect this is the reason the model performs poorly on shorter input sequences as it was always trained on almost full size sequences.
Yet this was quite unexpected for me.
So the things I want to discuss are:
- Is this behavior normal/expected?
- Has anyone come across something similar in perhaps other similar models, e.g. T5, Pegasus, etc?
- Does it even matter that the model performs poorly on shorter inputs on pre-training task? After all the reason of pre-training is to fine-tune it later.