In the Training a causal language model from scratch tutorial, the model accepts the inputs and labels as the same sequence, and outputs the logits and loss, for example:
loss, logits = model(batch["input_ids"], labels=batch["input_ids"])
is it implied that before the loss is calculated, the target sequence will be the label sequence shifted to the right by one token for us? said another way, what i mean is that the model is supposed to predict the next token in autoregressive models, so the target should not be the same as the source, but the source shifted to the right with the first token missing and with one more token into the future than the source sequence. This is made more clear in the âcustom lossâ section of the same tutorial:
def keytoken_weighted_loss(inputs, logits, keytoken_ids, alpha=1.0):
# Shift so that tokens < n predict n
shift_labels = inputs[..., 1:].contiguous()
shift_logits = logits[..., :-1, :].contiguous()
# Calculate per-token loss
loss_fct = CrossEntropyLoss(reduce=False)
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
So I assume this is also being done for me when I pass the same sequence into model()
or into the Trainer
âs data_collator
data_collator=lambda data: {
'input_ids': torch.stack([f[0] for f in data]),
'attention_mask': torch.stack([f[1] for f in data]),
'labels': torch.stack([f[0] for f in data])
}
its just confusing to me because the way that is written, it looks as if we are doing an autoencoder like task by predicting the input itself, but i know form the definition of autoregressive that this cant be what we are doing. Please help me sort out this confusion and point me to where the target sequence gets shifted from the labels if this does indeed happen, thanks!