How to compute per-token loss when doing language modeling?

I’m fine-tuning GPT-2 on a language modeling task. Given a sequence, I’d like to compute the per-token loss for each token (instead of the averaged loss over the sequence). How can I do it?

1 Like