Thank you for the suggested methods. Regarding the differences in output_logits between the two approaches, I believe it’s due to randomness in the inference process. When comparing the outputs between model.generate (without gradient computation) and the unwrapped model.generate.wrapped(with grad), I found that their decoded results are almost identical. Interestingly, the token distributions are nearly the same for approximately the first half of the sequence, but gradually diverge as the sequence length increases. This behavior is expected and can be attributed to the randomness introduced by the sampling strategy during generation.
More specifically:
- The initial token distributions are highly consistent between both methods
- Divergence increases with sequence length
- Final outputs remain semantically similar despite minor differences
- This pattern aligns with the expected behavior of stochastic sampling methods
logits_with_grad: tensor([[[-14.6484, -10.0078, -12.7109, ..., 11.3516, 11.3516, 11.3516]],
[[ -5.3594, -3.2188, -5.5977, ..., 7.5000, 7.5000, 7.5000]],
[[-21.5312, -16.6562, -18.5938, ..., 12.9766, 12.9688, 12.9766]],
...,
[[ 3.4531, 2.1445, -1.0059, ..., -0.3789, -0.3787, -0.3787]],
[[ 14.2656, 3.1895, 0.5615, ..., -1.7119, -1.7119, -1.7119]],
[[ 1.0234, -0.8833, -1.7861, ..., -0.6836, -0.6831, -0.6831]]],
device='cuda:0', grad_fn=<StackBackward0>)
logits_no_grad: tensor([[[-14.6484, -10.0078, -12.7109, ..., 11.3516, 11.3516, 11.3516]],
[[ -5.3594, -3.2188, -5.5977, ..., 7.5000, 7.5000, 7.5000]],
[[-21.5312, -16.6562, -18.5938, ..., 12.9766, 12.9688, 12.9766]],
...,
[[ 3.2891, 4.6953, 1.4385, ..., -1.8242, -1.8242, -1.8242]],
[[ 2.5020, 2.7227, -0.5269, ..., -1.3711, -1.3711, -1.3711]],
[[ 3.3125, 4.7188, 0.6821, ..., -1.4043, -1.4043, -1.4043]]],
device='cuda:0')