I am working on multiclass text classification, currently using XLM-Roberta as a classifier. I have a doubt concerning padding strategies.
My first intuition was to tokenize my training and validation sets separately (as they were two distinct super-batches) using padding = True; the result of this was having all training examples padded to length l1 (the length of the longest sequence in the training set), and validation examples padded to a different l2.
An alternative approach (and the one that seems to be used in the GlueDataset and related methods) is to use padding = max_length, and thus have all examples padded to the same provided length (possibly, 512, which is the maximum sequence length allowed for this model).
Would you mind sharing your thoughts on what strategy might work best and makes more sense from a “theoretical” point of view?
My understanding is that so long as you have your padding mask correctly implemented, the model will not “pay attention” to the pad tokens and so the predictions should be consistent across models (regardless of padding length).
If you do not use a padding mask then the predictions could differ, because the attention weights for the pad tokens will have some affect on your predicted outcome. By having an effect, I mean that the pad tokens contribute to the attention score, and therefore to the loss as a result.
In terms of time complexity, my understanding is somewhat murkier. I suspect that the first method you suggest is faster but it depends on the implementation. If we calculate all attention scores and set those relating to padding to zero, then the max_length version could be much slower. However, I think it is reasonable to assume that the implementation performs a check first i.e. “should I calculate attention here?” which would be only slightly slower than the l1, l2 padding. This point is open to correction!
I could be incorrect in my approach, but I prefer to pad to max_length for the sheer convenience of it. In the l1, l2 approach for example, methods such as cross validation require the repetition of the length calculation.
Thank you all for your answers. Whereas I agree that the specific padding strategy should not affect the results, in my implementation (actually, implementations, as I am testing alternative codes) the training results do differ. Perhaps not huge differences, but I can still see them (even though I am keeping all the rest fixed, seeds included).
So I was wondering whether my implementations are wrong or perhaps there is the possibility that somehow inputs padded at different lengths can have a small numerical effect on the gradients…