Cross Entropy Loss and loss of HuggingFace T5ForConditionalGeneration does not matches

ayush488 · August 25, 2021, 7:07pm

Hello, I am using T5ForConditionalGeneration for Question & Answering Model and Finetuning it, but In the train step, hugginface loss and my loss is not being matched, I want it for some experiment purpose.

class UQAFineTuneModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(
            "allenai/unifiedqa-t5-small", return_dict=True
        )
        self.model.train()
    def forward(
        self,
        source_text_input_ids,
        source_text_attention_mask,
        target_text_input_ids=None,
    ):
        output = self.model(
            input_ids=source_text_input_ids,
            attention_mask=source_text_attention_mask,
            labels=target_text_input_ids,
        )
        return output.loss, output.logits

    def training_step(self, batch, batch_idx):
        source_text_input_ids = batch["source_text_input_ids"]
        source_text_attention_mask = batch["source_text_attention_mask"]
        target_text_input_ids = batch["target_text_input_ids"]
        # labels_attention_mask = batch["target_text_attention_mask"]
        loss, outputs = self(
            source_text_input_ids, source_text_attention_mask, target_text_input_ids
        ) 
        loss_mine = None  

        output = self.model(
            input_ids=source_text_input_ids,
            attention_mask=source_text_attention_mask,
            labels=target_text_input_ids,
        ) 
        labels = batch["target_text_input_ids"].clone() 
        labels[labels == 0] = -100 
        if target_text_input_ids is not None:  
            loss_fct = CrossEntropyLoss(ignore_index=-100) 
            loss_mine = loss_fct(output.logits.view(-1, outputs.size(-1)), labels.view(-1)) 
            print(f"loss_huggingface: {loss.item()}, loss_mine : {loss_mine.item()}") 
        self.log("train_loss", loss, prog_bar=True, logger=True)
        return {"loss": loss, "predictions": outputs}

But my loss is different then huggingface loss, however they both are using CrossEntropy,

ayush488 · August 26, 2021, 7:52am

@valhalla , @BramVanroy Plz have a look at this

BramVanroy · August 26, 2021, 10:11am

Why would you expect this to be identical in the first place? There are too many factors here that are not accounted for (and using a model INSIDE a PL model seems very bad practice to begin with).

Your code might not be deterministic. Even for a single model, this will lead to different results in two separate runs.
Read the PyTorch documentation on Reproducibility and torch.use_deterministic_algorithms
Even when deterministic, the models won’t necessarily be identical
Different model implementation will not necessarily have the exact loss per-step, but should converge in a similar manner
Even if the models were identical, and you fixed the initial random seed, the outputs would STILL be different because you are training two models at the same time. That means that the seeds that are used at every step are NEVER THE SAME (or highly unlikely). So first you init model 1 and then model 2, but those will be initialised differently because the next RNG number is called for the seed.

In sum: you should not expect that two models (different or not) output the exact same loss when run consecutively in the same script.

Also, for the future: please have more patience and do not tag people so urgently. 24h have not even passed since you posted this question. We are all willing to help, but if every new question was tagging us, it would be very tiresome.

ayush488 · August 26, 2021, 4:30pm

Hey Bram, We’re not comparing loss between two runs. We’re calculating loss twice using the same logit scores. In the same step at exactly the same time. So, Can you check the code, let me know where’s the issue arises.

BramVanroy · August 26, 2021, 10:20pm

I only now understand that you are mix-and-matching self() and self.model(). This seems a question that pertains to PyTorch Lightning and not Transformers. Better to ask the people over there, specifically how their loss function is implemented and - importantly - how they reshape the labels and outputs.

ayush488 · August 27, 2021, 9:11am

Hey, I asked but they are telling to ask to Hugging Face

BramVanroy · August 27, 2021, 9:56am

My first thought would be that you only add the ignore index when you calculate your own loss. But before that, the loss that is calculated within the HF model does not have -100 in its labels yet. AFAIK forward does not modify the given labels. So this should be before both model calls, I think:

labels = batch["target_text_input_ids"].clone() 
labels[labels == 0] = -100

ayush488 · August 30, 2021, 1:21pm

Can you please elaborate little bit?

BramVanroy · August 30, 2021, 2:51pm

See my comments below. Only where you calculate loss manually you replace 0 with -100. This replacement does not happen in the built-in T5ForConditionalGeneration method so you have to do the replacement beforehand.

# Here you get loss based on "target_text_input_ids" as-is (no ignored index)
loss, outputs = self(
    source_text_input_ids, source_text_attention_mask, target_text_input_ids
)
loss_mine = None

output = self.model(
    input_ids=source_text_input_ids,
    attention_mask=source_text_attention_mask,
    labels=target_text_input_ids,
)

# Here you first set the padding IDs to -100 so that CE will ignore them...
labels = batch["target_text_input_ids"].clone()
labels[labels == 0] = -100
if target_text_input_ids is not None:
    loss_fct = CrossEntropyLoss(ignore_index=-100)
    # ... and THEN you calculate loss
    loss_mine = loss_fct(output.logits.view(-1, outputs.size(-1)), labels.view(-1))
    print(f"loss_huggingface: {loss.item()}, loss_mine : {loss_mine.item()}")

ayush488 · September 1, 2021, 3:24am

Hi, Bram, I tried this but It does not work, but I debugged it, and solve this issue, it was a very silly mistake. I was taking different losses from two different models.

class UQAFineTuneModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(
            "allenai/unifiedqa-t5-small", return_dict=True
        )
        self.model.train()
    def forward(
        self,
        source_text_input_ids,
        source_text_attention_mask,
        target_text_input_ids=None,
    ):
        output = self.model(
            input_ids=source_text_input_ids,
            attention_mask=source_text_attention_mask,
            labels=target_text_input_ids,
        )
        return output.loss, output.logits

    def training_step(self, batch, batch_idx):
        source_text_input_ids = batch["source_text_input_ids"]
        source_text_attention_mask = batch["source_text_attention_mask"]
        target_text_input_ids = batch["target_text_input_ids"]
        # labels_attention_mask = batch["target_text_attention_mask"]
        loss, outputs = self(
            source_text_input_ids, source_text_attention_mask, target_text_input_ids
        ) 
        loss_mine = None  

        output = self.model(
            input_ids=source_text_input_ids,
            attention_mask=source_text_attention_mask,
            labels=target_text_input_ids,
        ) 
        labels = batch["target_text_input_ids"].clone() 
        labels[labels == 0] = -100 
        if target_text_input_ids is not None:  
            loss_fct = CrossEntropyLoss(ignore_index=-100)  
           # here you can see I am taking output.logits and outputs,  
           # but It should be same, so it will be outputs 
           # loss_mine = loss_fct(output.logits.view(-1, outputs.size(-1)), labels.view(-1))  
# It should be 
           loss_mine = loss_fct(outputs.view(-1, outputs.size(-1)), labels.view(-1)) 
            print(f"loss_huggingface: {loss.item()}, loss_mine : {loss_mine.item()}") 
        self.log("train_loss", loss, prog_bar=True, logger=True)
        return {"loss": loss, "predictions": outputs}

BramVanroy · September 1, 2021, 6:29am

So it was a PL issue after all.

sharbel · November 29, 2023, 1:43am

@ayush488 can you please share how you tokenize the Q&A task using T5ForConditionalGeneration? I’m trying to train a model that does Summarization + Q&A and everything trains, but only the Summarizer gives results. I am positive that I am tokenizing the Q&A samples wrong (eg: where does the [start-index] go etc?). When using AutoModelForQuestionAnswering it’s pretty straightforward because there are a lot of examples but I can’t find any for tokenizing Q&A when using T5ForConditionalGeneration. Any hints would be greatly appreciated!

Topic		Replies	Views
How to train TFT5ForConditionalGeneration model? 🤗Transformers	5	3329	November 21, 2020
Traing loss decreases but dev accuracy gives zero Beginners	0	364	January 10, 2023
T5 Model Generate and Model Outputs Vastly Different Beginners	1	815	September 11, 2022
What is the loss function of a pre-trained T5 model? Models	1	1198	June 19, 2023
T5 variants return Training Loss 0 and Validation loss nan while fine tuning 🤗Transformers	8	5438	November 10, 2024

Cross Entropy Loss and loss of HuggingFace T5ForConditionalGeneration does not matches

Related topics