How to output loss from model.generate()?

mittu · April 20, 2022, 10:56am

I need the probability distribution of word generation to calculate the loss in my original loss function.
In particular, the model differs from a normal loss function in that it generates sentences in the same way as when testing.

Therefore, I have tried to calculate the loss using model.generate(), but this method does not leave me with the calculated graph needed to calculate the gradient. Could this be easily solved by passing a special argument to the method? Or is there an equally simple solution? Or do I have to implement my own function that generates the text in a way that leaves the computed graph?

tomroth1001 · April 22, 2022, 1:36am

I can’t answer many of your questions, but I did find this code snippet useful to get a computational graph with generate():

from undecorated import undecorated
from types import MethodType

generate_with_grad = undecorated(model.generate)
model.generate_with_grad = MethodType(generate_with_grad, model)

The generate() function has a no_grad decorator that stops the computational graph being returned, and this code just removes the decorator and leaves the rest of the generate function unchanged.

mittu · April 22, 2022, 5:37am

Thanks for sharing your simple solution!

I’ll give this method a try!

mittu · April 22, 2022, 6:03am

thanks! tomroth1001!
I got scores with computational graph!

halamvac · October 1, 2022, 7:38pm

Hello, I tried this method to retain the computational graph and it works. However when I try to backpropagate some loss computed from the generation scores I get an error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Which is triggered somewhere in generation_logits_process.py.

Last few lines of the trace:

next_token_scores_processed = logits_processor(
File “/home/halamvac/venvs/venv39/lib/python3.9/site-packages/transformers/generation_logits_process.py”, line 92, in call
scores = processor(input_ids, scores)
File “/home/halamvac/venvs/venv39/lib/python3.9/site-packages/transformers/generation_logits_process.py”, line 161, in call
score = torch.gather(scores, 1, input_ids)

@mittu @tomroth1001 Did any of you make it work with backpropagating the gradient?

tomroth1001 · October 1, 2022, 11:30pm

Not sure, sorry. It worked for my case.
The cliché advice is to make sure all your packages are up to date and then try again—might be a bug.

halamvac · October 3, 2022, 11:50am

Thanks for the tip but I already have the latest versions. Might I ask what versions of pytorch and transformers are you using?

mittu · October 4, 2022, 7:58pm

I’m using transformers 4.19.2 and torch 1.11.0+cu113, and it still work with out error
And my minimum code is here.

from transformers import BartForConditionalGeneration
from undecorated import undecorated
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
generate_with_grad = undecorated(model.generate)
model.generate_with_grad = MethodType(generate_with_grad, model)

output=model.generate_with_grad(
  input_ids=input_ids,
  output_scores=True,return_dict_in_generate=True,
  output_hidden_states = True,
  )

According to error message, there is an in-place operation at somewhere, so It would be a good idea to do a backward look, commenting out each calculation process line by line from the end to the beginning, to find which calculation process has the problem. Note backward() is possible only for scalars, so for multidimensional tensors, it is better to sum or average as appropriate.

Hottuck · January 26, 2024, 4:22am

I have a question.
Are there concerns that temporarily removing @torch.no_grad() could lead to learning during the test phase?

jhuang · April 18, 2024, 1:01am

I got the following error while trying the un-decorating method.

TypeError: _DecoratorContextManager.clone() got an unexpected keyword argument 'input_ids'

Package versions

torch == 2.1.0+cu121
transformers == 4.33.2
undecorated == 0.3.0

liyongkang · April 18, 2024, 6:30pm

@jhuang Hi Jiaji, I have the same problem as you, did you fix it? Thank you~

jhuang · April 19, 2024, 1:16am

Yes, this works for me

output = model.generate.__wrapped__(model, input_ids, output_scores=True, return_dict_in_generate=True)

You should see output.scores[0].requires_grad=True.

Note: model.generate.__wrapped__ is the actual function generate(), getting rid of the @no_grad() decorator. But as the function is a method of a python class, we’d have to input something for its first argument self. So here I simply put model as first argument.

The undecorate method doesn’t work, as it gets the decorator @no_grad itself rather than the actual generate() function.

liyongkang · April 25, 2024, 9:24am

Hi Jiaji， Sorry for the late reply. I have been testing according to the method you provided these days, thank you very much!
Thanks again for your quick response.

jhuang · April 25, 2024, 5:22pm

@liyongkang No problem! However, I do observe some notable difference in gradients against the following approach:

generate as usual with no_grad
forward pass of prompt concatenated with generated tokens.
calculate some loss function and call loss.backward()

I am curious about your test outcome. Let me know what you find?

liyongkang · April 25, 2024, 9:27pm

Hi Jiaji, Maybe it is because output.scores are not the raw logits, you can check this . And set output_logits=True then it will return the raw logits.

I originally hoped that this gradient would help me have more control over the generated content. However， I found that I couldn’t even control the length of the generated text (I can only specify the maximum length, but not the generated text to a certain length. If the length is uncertain, I cannot calculate my loss. ), so I switched to other strategies.

Thanks again for your help.

jhuang · April 26, 2024, 5:29pm

I’m pretty sure in both cases, they are raw logits. In fact, the difference between the logits in these two cases are tiny. But the backward() results in notable difference in gradient. I’d say the steps 1-3 list above is a safer choice.

KKKe3 · January 7, 2025, 11:06am

Thank you for the suggested methods. Regarding the differences in output_logits between the two approaches, I believe it’s due to randomness in the inference process. When comparing the outputs between model.generate (without gradient computation) and the unwrapped model.generate.wrapped(with grad), I found that their decoded results are almost identical. Interestingly, the token distributions are nearly the same for approximately the first half of the sequence, but gradually diverge as the sequence length increases. This behavior is expected and can be attributed to the randomness introduced by the sampling strategy during generation.

More specifically:

The initial token distributions are highly consistent between both methods
Divergence increases with sequence length
Final outputs remain semantically similar despite minor differences
This pattern aligns with the expected behavior of stochastic sampling methods

logits_with_grad: tensor([[[-14.6484, -10.0078, -12.7109,  ...,  11.3516,  11.3516,  11.3516]],

        [[ -5.3594,  -3.2188,  -5.5977,  ...,   7.5000,   7.5000,   7.5000]],

        [[-21.5312, -16.6562, -18.5938,  ...,  12.9766,  12.9688,  12.9766]],

        ...,

        [[  3.4531,   2.1445,  -1.0059,  ...,  -0.3789,  -0.3787,  -0.3787]],

        [[ 14.2656,   3.1895,   0.5615,  ...,  -1.7119,  -1.7119,  -1.7119]],

        [[  1.0234,  -0.8833,  -1.7861,  ...,  -0.6836,  -0.6831,  -0.6831]]],
       device='cuda:0', grad_fn=<StackBackward0>)

logits_no_grad: tensor([[[-14.6484, -10.0078, -12.7109,  ...,  11.3516,  11.3516,  11.3516]],

        [[ -5.3594,  -3.2188,  -5.5977,  ...,   7.5000,   7.5000,   7.5000]],

        [[-21.5312, -16.6562, -18.5938,  ...,  12.9766,  12.9688,  12.9766]],

        ...,

        [[  3.2891,   4.6953,   1.4385,  ...,  -1.8242,  -1.8242,  -1.8242]],

        [[  2.5020,   2.7227,  -0.5269,  ...,  -1.3711,  -1.3711,  -1.3711]],

        [[  3.3125,   4.7188,   0.6821,  ...,  -1.4043,  -1.4043,  -1.4043]]],
       device='cuda:0')

Topic		Replies	Views
Difference between model.generate() and model() outputs Intermediate	2	2737	March 3, 2024
Loss in a Seq2Seq task 🤗Transformers	0	156	June 5, 2024
[Announcement] Generation: Get probabilities for generated output 🤗Transformers	63	40509	January 20, 2025
T5 user defined loss function Beginners	11	4792	September 23, 2020
Carrying Gradients Through Generate Research	5	2721	January 29, 2023

How to output loss from model.generate()?

Related topics