Reason for discrepancy between loss calculation in XLNetLMHeadModel and GPT2LMHeadModel

ThomasG · July 12, 2022, 3:24pm

Hi,

I am looking at the code used to calculate the loss in these two cases: GPT2LMHeadModel and XLNetLMHeadModel . As far as I understand, both these architectures have the same purpose: performing causal language modeling. Based on this, it makes sense that the loss calculation in the GPT2 case goes as follows inside the forward method (taken from the code linked earlier):

# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

This makes sense because it’s a causal LM and every time it predicts the next token based on the previous ones, as described in the comment (# Shift so that tokens < n predict n).

The code for XLNet however is different:

# Flatten the tokens
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))

For GPT2, this correction was added in this commit. Does the same need to happen for XLNet, or is XLNet’s loss calculation correct as it is?

Thanks in advance.

Topic		Replies	Views
What is `self.loss_function` in `forward()` of newly released LLM? 🤗Transformers	0	50	January 14, 2025
Having troubel in understanding what loss is currently in use Beginners	1	751	November 24, 2023
Newbie Understanding GPT2 loss 🤗Transformers	1	5113	March 12, 2023
Loss computed for single token in GPT-2 Intermediate	0	331	April 12, 2023
Alternative Language Modeling Loss Calculation 🤗Transformers	0	79	September 25, 2024

Reason for discrepancy between loss calculation in XLNetLMHeadModel and GPT2LMHeadModel

Related topics