How to enable consistent padding in distributed training

Sannndy · June 18, 2024, 4:17am

I implemented my own collate_fn to perform padding on both my input and target sequences. My programs was hang during evaluation while doing gather(). I checked and found that the batches distributed to each GPU were not consistently padded (i.e., padded to the longest length within the batch, but not consistent across GPUs).

I think this should be a common situation in training distributed seq2seq models, but I can’t find a solution to this. Is there a way to ensure consistent padding?

Thanks!

Topic		Replies	Views
Trainer errors out when concatenating different sequence length batches with distributed training and IterableDataset 🤗Transformers	0	204	October 2, 2023
Padding in datasets 🤗Datasets	6	5080	October 21, 2021
Can I train pytorch T5 on TPU with variable batch shape? 🤗Transformers	2	302	March 6, 2021
Different padding behaviour of data collator 🤗Transformers	0	119	August 23, 2024
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10435	August 10, 2023

How to enable consistent padding in distributed training

Related topics