Reason for discrepancy between loss calculation in XLNetLMHeadModel and GPT2LMHeadModel

Hi,

I am looking at the code used to calculate the loss in these two cases: GPT2LMHeadModel and XLNetLMHeadModel . As far as I understand, both these architectures have the same purpose: performing causal language modeling. Based on this, it makes sense that the loss calculation in the GPT2 case goes as follows inside the forward method (taken from the code linked earlier):

# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

This makes sense because it’s a causal LM and every time it predicts the next token based on the previous ones, as described in the comment (# Shift so that tokens < n predict n).

The code for XLNet however is different:

# Flatten the tokens
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))

For GPT2, this correction was added in this commit. Does the same need to happen for XLNet, or is XLNet’s loss calculation correct as it is?

Thanks in advance.