How to compute per-token loss when doing language modeling?

I’m fine-tuning GPT-2 on a language modeling task. Given a sequence, I’d like to compute the per-token loss for each token (instead of the averaged loss over the sequence). How can I do it?

2 Likes

Hey, it’s been some time since you posted this. Were you able to do it? If yes, would you be able to share code for it?

In pytorch, you can use CrossEntropyLoss function with reduction param set to none ( CrossEntropyLoss — PyTorch 2.0 documentation ).

You can find an example in a HF course: Training a causal language model from scratch - Hugging Face NLP Course (search for keytoken_weighted_loss to find it)

1 Like

Thanks for your reply @chrisociepa. Yes, I was able to work it with reduction='none'