Cross Entropy Loss and loss of HuggingFace T5ForConditionalGeneration does not matches

Hello, I am using T5ForConditionalGeneration for Question & Answering Model and Finetuning it, but In the train step, hugginface loss and my loss is not being matched, I want it for some experiment purpose.

class UQAFineTuneModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(
            "allenai/unifiedqa-t5-small", return_dict=True
        )
        self.model.train()
    def forward(
        self,
        source_text_input_ids,
        source_text_attention_mask,
        target_text_input_ids=None,
    ):
        output = self.model(
            input_ids=source_text_input_ids,
            attention_mask=source_text_attention_mask,
            labels=target_text_input_ids,
        )
        return output.loss, output.logits

    def training_step(self, batch, batch_idx):
        source_text_input_ids = batch["source_text_input_ids"]
        source_text_attention_mask = batch["source_text_attention_mask"]
        target_text_input_ids = batch["target_text_input_ids"]
        # labels_attention_mask = batch["target_text_attention_mask"]
        loss, outputs = self(
            source_text_input_ids, source_text_attention_mask, target_text_input_ids
        ) 
        loss_mine = None  

        output = self.model(
            input_ids=source_text_input_ids,
            attention_mask=source_text_attention_mask,
            labels=target_text_input_ids,
        ) 
        labels = batch["target_text_input_ids"].clone() 
        labels[labels == 0] = -100 
        if target_text_input_ids is not None:  
            loss_fct = CrossEntropyLoss(ignore_index=-100) 
            loss_mine = loss_fct(output.logits.view(-1, outputs.size(-1)), labels.view(-1)) 
            print(f"loss_huggingface: {loss.item()}, loss_mine : {loss_mine.item()}") 
        self.log("train_loss", loss, prog_bar=True, logger=True)
        return {"loss": loss, "predictions": outputs}

But my loss is different then huggingface loss, however they both are using CrossEntropy,

@valhalla , @BramVanroy Plz have a look at this

Why would you expect this to be identical in the first place? There are too many factors here that are not accounted for (and using a model INSIDE a PL model seems very bad practice to begin with).

  • Your code might not be deterministic. Even for a single model, this will lead to different results in two separate runs.
  • Read the PyTorch documentation on Reproducibility and torch.use_deterministic_algorithms
  • Even when deterministic, the models won’t necessarily be identical
  • Different model implementation will not necessarily have the exact loss per-step, but should converge in a similar manner
  • Even if the models were identical, and you fixed the initial random seed, the outputs would STILL be different because you are training two models at the same time. That means that the seeds that are used at every step are NEVER THE SAME (or highly unlikely). So first you init model 1 and then model 2, but those will be initialised differently because the next RNG number is called for the seed.

In sum: you should not expect that two models (different or not) output the exact same loss when run consecutively in the same script.

Also, for the future: please have more patience and do not tag people so urgently. 24h have not even passed since you posted this question. We are all willing to help, but if every new question was tagging us, it would be very tiresome.

Hey Bram, We’re not comparing loss between two runs. We’re calculating loss twice using the same logit scores. In the same step at exactly the same time. So, Can you check the code, let me know where’s the issue arises.

I only now understand that you are mix-and-matching self() and self.model(). This seems a question that pertains to PyTorch Lightning and not Transformers. Better to ask the people over there, specifically how their loss function is implemented and - importantly - how they reshape the labels and outputs.

Hey, I asked but they are telling to ask to Hugging Face

My first thought would be that you only add the ignore index when you calculate your own loss. But before that, the loss that is calculated within the HF model does not have -100 in its labels yet. AFAIK forward does not modify the given labels. So this should be before both model calls, I think:

labels = batch["target_text_input_ids"].clone() 
labels[labels == 0] = -100 

Can you please elaborate little bit?

See my comments below. Only where you calculate loss manually you replace 0 with -100. This replacement does not happen in the built-in T5ForConditionalGeneration method so you have to do the replacement beforehand.

# Here you get loss based on "target_text_input_ids" as-is (no ignored index)
loss, outputs = self(
    source_text_input_ids, source_text_attention_mask, target_text_input_ids
)
loss_mine = None

output = self.model(
    input_ids=source_text_input_ids,
    attention_mask=source_text_attention_mask,
    labels=target_text_input_ids,
)

# Here you first set the padding IDs to -100 so that CE will ignore them...
labels = batch["target_text_input_ids"].clone()
labels[labels == 0] = -100
if target_text_input_ids is not None:
    loss_fct = CrossEntropyLoss(ignore_index=-100)
    # ... and THEN you calculate loss
    loss_mine = loss_fct(output.logits.view(-1, outputs.size(-1)), labels.view(-1))
    print(f"loss_huggingface: {loss.item()}, loss_mine : {loss_mine.item()}")

Hi, Bram, I tried this but It does not work, but I debugged it, and solve this issue, it was a very silly mistake. I was taking different losses from two different models.

class UQAFineTuneModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(
            "allenai/unifiedqa-t5-small", return_dict=True
        )
        self.model.train()
    def forward(
        self,
        source_text_input_ids,
        source_text_attention_mask,
        target_text_input_ids=None,
    ):
        output = self.model(
            input_ids=source_text_input_ids,
            attention_mask=source_text_attention_mask,
            labels=target_text_input_ids,
        )
        return output.loss, output.logits

    def training_step(self, batch, batch_idx):
        source_text_input_ids = batch["source_text_input_ids"]
        source_text_attention_mask = batch["source_text_attention_mask"]
        target_text_input_ids = batch["target_text_input_ids"]
        # labels_attention_mask = batch["target_text_attention_mask"]
        loss, outputs = self(
            source_text_input_ids, source_text_attention_mask, target_text_input_ids
        ) 
        loss_mine = None  

        output = self.model(
            input_ids=source_text_input_ids,
            attention_mask=source_text_attention_mask,
            labels=target_text_input_ids,
        ) 
        labels = batch["target_text_input_ids"].clone() 
        labels[labels == 0] = -100 
        if target_text_input_ids is not None:  
            loss_fct = CrossEntropyLoss(ignore_index=-100)  
           # here you can see I am taking output.logits and outputs,  
           # but It should be same, so it will be outputs 
           # loss_mine = loss_fct(output.logits.view(-1, outputs.size(-1)), labels.view(-1))  
# It should be 
           loss_mine = loss_fct(outputs.view(-1, outputs.size(-1)), labels.view(-1)) 
            print(f"loss_huggingface: {loss.item()}, loss_mine : {loss_mine.item()}") 
        self.log("train_loss", loss, prog_bar=True, logger=True)
        return {"loss": loss, "predictions": outputs}

So it was a PL issue after all. :wink: