T5 user defined loss function

I am fine-tuning T5 for paraphrase generation and want to add a diversity measure for the generated sentences in the loss function. After reading the source code, I still have no clue how to add that.

I know I can generate multiple sentences using:
outs = model.generate(input_ids=batch[‘source_ids’].cuda(),
max_length=maxlen,do_sample=True, top_k=120,
and I know how to calculate my metrics based on this ‘outs’.
However, I don’t know how to find this outputs in the return of ‘forward’ function for ’ T5ForConditionalGeneration’.

Also, I couldn’t find the definition for this ‘generate’ function.

HI @mengyahu, T5ForConditionalGeneration won’t return the generations, you need to call generate yourself to get the outs. When you pass labels it calculates the standard cross-entropy loss, here

generate is defined here

Thanks, @valhalla Suraj! This is very helpful!

Could you help me understand the difference of ‘forward’ and ‘_step’ in your example code:

def forward(
      self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
    return self.model(

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(

    loss = outputs[0]

    return loss

My understanding is ‘self(xxxxxx)’ in the ‘_step’ is running the ‘forward’ function defined above and the ‘self.model(xxxxxx)’ in the ‘forward’ function above is running the ‘forward’ function of T5ForConditionalGeneration.from_pretrained(hparams.model_name_or_path).

so to define my own loss function, I need to define it in the ‘_step’ like:

def _step(self, batch):
        labels = batch["target_ids"]
        labels[labels[:, :] == self.tokenizer.pad_token_id] = -100

        outputs = self(

        loss1 = outputs[0]

       beam_outputs = self.generate( xxxxxx )
       loss2 = my_metrics(beam_outputs)
       loss = loss1+loss2            
       return loss

Here I use self.generate( xxxxxx ) rather than self.model.generate(xx) because self.model is that pretrained model in the input, right?


Yes, you are right about the _step method. And .generate can’t be called on self because here self is an instance of the LightningModule .


Just a tip to save you some hassle in the event that you did not already know what I’m about to say.

You’re going to hit a snag in your idea here if you try to pass gradients from this new loss, but of course it is fine for a logging metric.

Gradients cannot flow through a sampling method such as arg max, beam search, or nucleus sampling because the function is non-differentiable. If you train your model with this loss, it will have no bearing on your results.

loss = diversity_loss + lm_loss
loss.backward() # gradients for diversity_loss will all be zero, but your model will still train, so be careful, it is not impacting your training whatsoever!
1 Like

Thanks for the reminder, @chrisdoyleIE ! This is surprising to me! Could you provide me with a solution if I want diversity_loss to have influence on my model?

You need a differentiable model to do the sampling for you :slight_smile:

Let V be the set of words in the vocabulary. Some models define a reinforcement learning model with a state space vector x with dimension |V|, such that x_i can be any integer in V, and a discreet action space of all integers in V.

Someone linked a paper from salesforce which follows this general idea but adds a few useful bells and whistles.

Thanks! I linked the paper. They just defined the loss on the sampling, but did not provide code, so I am not sure how they did it.

I do not know what is ‘a differentiable model to do the sampling’. Could you give me more details on this?

It’s a whole field within itself and difficult to describe in a paragraph, but I’ll try to point you in the right direction.

Check out reinforcement learning first, then read that salesforce paper with newfound vigour! The way to make sampling differentiable is to train a function to do this job, such that the input is your probability distribution, and the output is some index in the range [0, V].

Beyond this explanation, I’m afraid I can’t offer too much help. Check out some of the papers with reinforcement learning in them here

1 Like

Thanks! This guidance is very helpful!

Hi. Thanks for your explanation.
Do you have a paper, or some key-words to search on the thing you said about training a function to do the sampling?