I’m fine-tuning GPT-2 on a language modeling task. Given a sequence, I’d like to compute the per-token loss for each token (instead of the averaged loss over the sequence). How can I do it?
1 Like
Hey, it’s been some time since you posted this. Were you able to do it? If yes, would you be able to share code for it?
In pytorch, you can use CrossEntropyLoss
function with reduction
param set to none
( CrossEntropyLoss — PyTorch 2.0 documentation ).
You can find an example in a HF course: Training a causal language model from scratch - Hugging Face NLP Course (search for keytoken_weighted_loss
to find it)
2 Likes
Thanks for your reply @chrisociepa. Yes, I was able to work it with reduction='none'