Why is the code for DataCollatorForSeq2Seq overwriting the labels?

hassanzadeh · August 20, 2021, 5:11pm

Hey Guys,
I can’t figure out why in the source code for DataCollatorForSeq2Seq, the feature[‘label’] is being overwritten? Does it not lead to code break when the training sample are shuffled?

BramVanroy · August 21, 2021, 2:08pm

I guess you are talking about this line (in the future please link to the code that you are talking about so that we can easily look it up).

github.com

huggingface/transformers/blob/143738214cb83e471f3a43652617c8881370342c/src/transformers/data/data_collator.py#L281-L287

    
      
          if labels is not None:
              max_label_length = max(len(l) for l in labels)
              padding_side = self.tokenizer.padding_side
              for feature in features:
                  remainder = [self.label_pad_token_id] * (max_label_length - len(feature["labels"]))
                  feature["labels"] = (
                      feature["labels"] + remainder if padding_side == "right" else remainder + feature["labels"]

It is not changing the given labels but it is padding them to ensure that all items in the batch have the same length. Pad tokens are ignored when the loss is calculated.

It is a collator, so it happens after the shuffling process of the dataloader. It receives a number of items from the dataloader (possibly shuffled) and then collates them (prepares them for the model).

hassanzadeh · August 23, 2021, 2:46pm

Hello Bram,
Thanks for your response, and sure, will mention the line in future.
Regarding your response, Yes I understand that we are doing padding, the problem is this (and correct me if I’m wrong). You are overwriting on the label of the same sample by adding pad tokens which will be ignored, the problem though is this: say the effective length of a sequence is 100 which is the longest in a batch (say others are around 20), now, when one such sample ends up in a batch, you elongate all other sample to reach 100, now since we are randomly shuffling, after a while majority of samples will have severely prolonged due to outliers. For my use case, this has a huge implications. You may ask why not to truncate those outliers, well, that does not solve the issue of the code. That’s why I asked why you are overwriting. Any thoughts?
best

BramVanroy · August 24, 2021, 7:46pm

I still don’t understand what you mean. How would this break anything? It might lead to some overhead in terms of speed if your batches are shuffled in such a way that a lot of padding needs to happen, but that is an issue you will always encounter with random shuffling.

Topic		Replies	Views
Error in DataCollator section of Hugging Face Tutorial LM fine tuning Beginners	2	258	January 12, 2024
Bug in Summarization tutorial Site Feedback	2	1961	March 21, 2024
DataCollator not padding as expected Intermediate	0	662	August 17, 2022
How is the data shifted by one token during CausalLM fine tuning Models	4	3183	April 14, 2025
Multilabel token classification (dataloader issues) 🤗Datasets	0	178	April 20, 2024

Why is the code for DataCollatorForSeq2Seq overwriting the labels?

Related topics