How to output loss from model.generate()?

I need the probability distribution of word generation to calculate the loss in my original loss function.
In particular, the model differs from a normal loss function in that it generates sentences in the same way as when testing.

Therefore, I have tried to calculate the loss using model.generate(), but this method does not leave me with the calculated graph needed to calculate the gradient. Could this be easily solved by passing a special argument to the method? Or is there an equally simple solution? Or do I have to implement my own function that generates the text in a way that leaves the computed graph?

I can’t answer many of your questions, but I did find this code snippet useful to get a computational graph with generate():

from undecorated import undecorated
from types import MethodType

generate_with_grad = undecorated(model.generate)
model.generate_with_grad = MethodType(generate_with_grad, model)

The generate() function has a no_grad decorator that stops the computational graph being returned, and this code just removes the decorator and leaves the rest of the generate function unchanged.

4 Likes

Thanks for sharing your simple solution! :heart_eyes:

I’ll give this method a try!

thanks! tomroth1001!
I got scores with computational graph!

1 Like

Hello, I tried this method to retain the computational graph and it works. However when I try to backpropagate some loss computed from the generation scores I get an error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Which is triggered somewhere in generation_logits_process.py.

Last few lines of the trace:

next_token_scores_processed = logits_processor(
File “/home/halamvac/venvs/venv39/lib/python3.9/site-packages/transformers/generation_logits_process.py”, line 92, in call
scores = processor(input_ids, scores)
File “/home/halamvac/venvs/venv39/lib/python3.9/site-packages/transformers/generation_logits_process.py”, line 161, in call
score = torch.gather(scores, 1, input_ids)

@mittu @tomroth1001 Did any of you make it work with backpropagating the gradient?

Not sure, sorry. It worked for my case.
The cliché advice is to make sure all your packages are up to date and then try again—might be a bug.

Thanks for the tip but I already have the latest versions. Might I ask what versions of pytorch and transformers are you using?

I’m using transformers 4.19.2 and torch 1.11.0+cu113, and it still work with out error
And my minimum code is here.

from transformers import BartForConditionalGeneration
from undecorated import undecorated
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
generate_with_grad = undecorated(model.generate)
model.generate_with_grad = MethodType(generate_with_grad, model)

output=model.generate_with_grad(
  input_ids=input_ids,
  output_scores=True,return_dict_in_generate=True,
  output_hidden_states = True,
  )

According to error message, there is an in-place operation at somewhere, so It would be a good idea to do a backward look, commenting out each calculation process line by line from the end to the beginning, to find which calculation process has the problem. Note backward() is possible only for scalars, so for multidimensional tensors, it is better to sum or average as appropriate.

I have a question.
Are there concerns that temporarily removing @torch.no_grad() could lead to learning during the test phase?

I got the following error while trying the un-decorating method.

TypeError: _DecoratorContextManager.clone() got an unexpected keyword argument 'input_ids'

Package versions

torch == 2.1.0+cu121
transformers == 4.33.2
undecorated == 0.3.0
1 Like

@jhuang Hi Jiaji, I have the same problem as you, did you fix it? Thank you~

Yes, this works for me

output = model.generate.__wrapped__(model, input_ids, output_scores=True, return_dict_in_generate=True)

You should see output.scores[0].requires_grad=True.

Note: model.generate.__wrapped__ is the actual function generate(), getting rid of the @no_grad() decorator. But as the function is a method of a python class, we’d have to input something for its first argument self. So here I simply put model as first argument.

The undecorate method doesn’t work, as it gets the decorator @no_grad itself rather than the actual generate() function.

1 Like

Hi Jiaji, Sorry for the late reply. I have been testing according to the method you provided these days, thank you very much!
Thanks again for your quick response.

@liyongkang No problem! However, I do observe some notable difference in gradients against the following approach:

  1. generate as usual with no_grad
  2. forward pass of prompt concatenated with generated tokens.
  3. calculate some loss function and call loss.backward()

I am curious about your test outcome. Let me know what you find?

Hi Jiaji, Maybe it is because output.scores are not the raw logits, you can check this . And set output_logits=True then it will return the raw logits.

I originally hoped that this gradient would help me have more control over the generated content. However, I found that I couldn’t even control the length of the generated text (I can only specify the maximum length, but not the generated text to a certain length. If the length is uncertain, I cannot calculate my loss. ), so I switched to other strategies.

Thanks again for your help.

I’m pretty sure in both cases, they are raw logits. In fact, the difference between the logits in these two cases are tiny. But the backward() results in notable difference in gradient. I’d say the steps 1-3 list above is a safer choice.

Thank you for the suggested methods. Regarding the differences in output_logits between the two approaches, I believe it’s due to randomness in the inference process. When comparing the outputs between model.generate (without gradient computation) and the unwrapped model.generate.wrapped(with grad), I found that their decoded results are almost identical. Interestingly, the token distributions are nearly the same for approximately the first half of the sequence, but gradually diverge as the sequence length increases. This behavior is expected and can be attributed to the randomness introduced by the sampling strategy during generation.

More specifically:

  • The initial token distributions are highly consistent between both methods
  • Divergence increases with sequence length
  • Final outputs remain semantically similar despite minor differences
  • This pattern aligns with the expected behavior of stochastic sampling methods
logits_with_grad: tensor([[[-14.6484, -10.0078, -12.7109,  ...,  11.3516,  11.3516,  11.3516]],

        [[ -5.3594,  -3.2188,  -5.5977,  ...,   7.5000,   7.5000,   7.5000]],

        [[-21.5312, -16.6562, -18.5938,  ...,  12.9766,  12.9688,  12.9766]],

        ...,

        [[  3.4531,   2.1445,  -1.0059,  ...,  -0.3789,  -0.3787,  -0.3787]],

        [[ 14.2656,   3.1895,   0.5615,  ...,  -1.7119,  -1.7119,  -1.7119]],

        [[  1.0234,  -0.8833,  -1.7861,  ...,  -0.6836,  -0.6831,  -0.6831]]],
       device='cuda:0', grad_fn=<StackBackward0>)

logits_no_grad: tensor([[[-14.6484, -10.0078, -12.7109,  ...,  11.3516,  11.3516,  11.3516]],

        [[ -5.3594,  -3.2188,  -5.5977,  ...,   7.5000,   7.5000,   7.5000]],

        [[-21.5312, -16.6562, -18.5938,  ...,  12.9766,  12.9688,  12.9766]],

        ...,

        [[  3.2891,   4.6953,   1.4385,  ...,  -1.8242,  -1.8242,  -1.8242]],

        [[  2.5020,   2.7227,  -0.5269,  ...,  -1.3711,  -1.3711,  -1.3711]],

        [[  3.3125,   4.7188,   0.6821,  ...,  -1.4043,  -1.4043,  -1.4043]]],
       device='cuda:0')
1 Like