Hello,
I believe I have found a minor error in the documentation on the following page: Causal language modeling.
In the tutorial section about the collator, it is written as follows:
"Use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)"
However, it seems to me that DataCollatorForLanguageModeling does not shift the labels
to create input_ids
. Instead, it keeps both values the same.
Could there be something I am misunderstanding?
Thank you for reading my message and for considering this potential correction.
1 Like
I believe this is handled by the forward method of the XXXFormerLMHeadModel class. For example, the transformers.models.gpt2.modeling_gpt2 from the GPT2LMHeadModel.forward method contains the following snippet:
loss = None
if labels is not None:
# move labels to correct device to enable model parallelism
labels = labels.to(lm_logits.device)
# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
1 Like
You are correct in saying that the label shifting handled within the model.
Therefore, the part pointed out in the tutorial is indeed an error.
Thank you for providing an example with the code.