Generate without using the generate method

That might be because this doesn’t cache the hidden states when generating, if I understand correctly. You would need to keep past_key_values or something like that by making sure use_cache is True in your model config.

Otherwise in the above snippet you’re re-computing the entire past sequence every time you want a next token, despite the fact that causal attention means all the past hidden states are constant.