I can’t figure out why in the source code for DataCollatorForSeq2Seq, the feature[‘label’] is being overwritten? Does it not lead to code break when the training sample are shuffled?
I guess you are talking about this line (in the future please link to the code that you are talking about so that we can easily look it up).
It is not changing the given labels but it is padding them to ensure that all items in the batch have the same length. Pad tokens are ignored when the loss is calculated.
It is a collator, so it happens after the shuffling process of the dataloader. It receives a number of items from the dataloader (possibly shuffled) and then collates them (prepares them for the model).
Thanks for your response, and sure, will mention the line in future.
Regarding your response, Yes I understand that we are doing padding, the problem is this (and correct me if I’m wrong). You are overwriting on the label of the same sample by adding pad tokens which will be ignored, the problem though is this: say the effective length of a sequence is 100 which is the longest in a batch (say others are around 20), now, when one such sample ends up in a batch, you elongate all other sample to reach 100, now since we are randomly shuffling, after a while majority of samples will have severely prolonged due to outliers. For my use case, this has a huge implications. You may ask why not to truncate those outliers, well, that does not solve the issue of the code. That’s why I asked why you are overwriting. Any thoughts?
I still don’t understand what you mean. How would this break anything? It might lead to some overhead in terms of speed if your batches are shuffled in such a way that a lot of padding needs to happen, but that is an issue you will always encounter with random shuffling.