Different padding behaviour of data collator

pagein · August 23, 2024, 7:38am

Hi all,
I have encountered, that DataCollatorForLanguageModeling has different behaviour regarding padding, when dicts or just tokens are passed to it.

If a sequence of Mapping is passed, then collator calls pad method without checking (inside pad_without_fast_tokenizer_warning function).
Otherwise, it calls _torch_collate_batch, who checks where padding is necessary.

So my question is whether this difference is intentional?

The reason why I am asking, is that when I do a LM-training, I already have packed sequences of tokens (so the padding is not required), but some tokenizers (LLaMA-3 for example) does not have the padding token, and the code fails in one case, but not in the other.

Topic		Replies	Views
DataCollator not padding as expected Intermediate	0	674	August 17, 2022
DataCollator vs. Tokenizers 🤗Transformers	1	3827	May 1, 2021
Padding in datasets 🤗Datasets	6	5098	October 21, 2021
DataCollatorWithPadding: TypeError Course	1	2002	November 21, 2021
PAD with Collator 🤗Datasets	1	649	June 4, 2021

Different padding behaviour of data collator

Related topics