Fine-tuning DistilGPT2 on custom data, training Accuracy 100%, output is garbage

I generally know somewhat what Iā€™m doing with ML and NLP, but new to Hugging Face Transformer models.

Trying to fine-tune DistilGPT2 model on some custom data. Iā€™m getting 100% training and val accuracy, but then when I try to generate output from new prompts (Causal Language Modeling), the output is just a bunch of blank spaces.

The project Iā€™m doing is for a portfolio project. But Iā€™ve lost 5 days on this and feel like Iā€™m gonna lose it :cry:. Can anyone please take a look and tell me what Iā€™m doing wrong? Hopefully itā€™s just some dumb mistake.

I made the notebook public at: Generate Book Reviews with Transformers | Kaggle

Thank you.

You got warning on the last cell

ā€œThe attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your inputā€™s attention_mask to obtain reliable results.ā€

Thank you so much for taking the time to reply, but unfortunately, thatā€™s not the problem.

I researched that warning before and itā€™s benign, because HF automatically fixes it for you if these are missing (if anyone is curious, the code where this is handled is transformers/src/transformers/generation/utils.py at 8e164c5400b7b413c7b8fb32e35132001effc970 Ā· huggingface/transformers Ā· GitHub )

However, I still wanted to fix the warning, which I do now in the updated notebook, but the problem still remains in that the output just adds a ton of spaces to the end.

How can this be achieving 100% val accuracy if itā€™s just adding spaces?

I mean, and what even does ā€œaccuracyā€ mean in the context of causal language modeling? I figured it meant it learned to predict the next word 100% accurately even in the validation set, but now that I think about it, that itself seems impossible.

I figured it out!!!

The issue was that I specified my own loss, using Kerasā€™ SparseCategoricalCrossentropy(from_logits=True). I had read that the HuggingFace models did not need you to specify a loss function, but that you still could. I guess I was testing my understanding that it should be cross-entropy loss when I decided to define it anyway.

Well, after digging into the source code, I found that the HF model uses a special loss with class CausalLanguageModelingLoss instead. It is basically the SparseCategoricalCrossentropy(from_logits=True) loss, but there is also some additional masking done of the -100 labels, which I believe are used for padding. Since the model is causal and canā€™t look into the future, those labels need to be masked out when calculating the loss, and my base loss function was not doing that.

Simply not specifying a loss function helped it to train, and is now generating text! Thank you for looking at this with me. Even though fixing that earlier warning wasnā€™t the issue, starting down that path is how I eventually got to thinking about the loss function it was computing, leading to the solution. :grin:

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.