Pad Tokens & Attention Masks with Data Collators

Chahnwoo · August 29, 2024, 8:19am

I have a few quick questions about how the HuggingFace DataCollator classes (DataCollatorForSeq2Seq in particular) process pad tokens.

Data Format:

{
        "input_ids" : [1, 2, 3, 4, 5, 6, ..., 2],
        "attention_masks" : [1, 1, 1, 1, 1, 1, ..., 1],
        "labels" : [-100, -100, -100, -100, -100, -100, ..., 2]
}

When data of the above format is passed to the DataCollatorForSeq2Seq and padding is performed, does the data collator also automatically add 0’s to the attention mask to indicate that the padding data should be ignored in the model’s attention mechanisms?
If the data collator does indeed perform the above, does this remain true models where the pad_token is not defined (e.g. LLaMA-3.1-8B)?

Topic		Replies	Views
Different padding behaviour of data collator 🤗Transformers	0	94	August 23, 2024
LLaMA2 - tokenizer padding affecting logits (even with attention_mask) 🤗Transformers	8	4541	March 26, 2024
Create custom data_collator for Huggingface Trainer 🤗Transformers	1	4109	July 22, 2022
Huggingface tokenizer object has no attribute 'pad' 🤗Transformers	1	1441	February 26, 2024
Seq2seq padding 🤗Transformers	1	69	October 10, 2024

Pad Tokens & Attention Masks with Data Collators

Related topics