Hi all!
Iam trying to instuct-finetune a T5 model for harmful text classification.
Unfortunately Iam a little bit confused about the padding strategy that one should and if choosing a different one would make any difference.
Right now Iam training with padding = True (longest) set in the tokenizer.
To my understanding, padding = True pads each batch to the longest sequence in the batch.
I perform padding ‘on the fly’ in the collate method.
In a lot of T5 finetuning implementations though, the training dataset is tokenized as a preprocessing step with padding = max_length, which means all samples are padded to the same max length.
Now Iam wondering if there is any right or wrong here and if it makes any difference to use either of the two methods?
One thing I noticed is that using max_length padding instead of longest padding increases the training time quite a lot.
Thanks in advance for any advice,
Cheers,
M