I have a few quick questions about how the HuggingFace DataCollator classes (DataCollatorForSeq2Seq in particular) process pad tokens.
Data Format:
{
"input_ids" : [1, 2, 3, 4, 5, 6, ..., 2],
"attention_masks" : [1, 1, 1, 1, 1, 1, ..., 1],
"labels" : [-100, -100, -100, -100, -100, -100, ..., 2]
}
- When data of the above format is passed to the DataCollatorForSeq2Seq and padding is performed, does the data collator also automatically add 0’s to the attention mask to indicate that the padding data should be ignored in the model’s attention mechanisms?
- If the data collator does indeed perform the above, does this remain true models where the
pad_token
is not defined (e.g. LLaMA-3.1-8B)?