Fine-tuning DistilGPT2 on custom data, training Accuracy 100%, output is garbage

I generally know somewhat what I’m doing with ML and NLP, but new to Hugging Face Transformer models.

Trying to fine-tune DistilGPT2 model on some custom data. I’m getting 100% training and val accuracy, but then when I try to generate output from new prompts (Causal Language Modeling), the output is just a bunch of blank spaces.

The project I’m doing is for a portfolio project. But I’ve lost 5 days on this and feel like I’m gonna lose it :cry:. Can anyone please take a look and tell me what I’m doing wrong? Hopefully it’s just some dumb mistake.

I made the notebook public at: Generate Book Reviews with Transformers | Kaggle

Thank you.

You got warning on the last cell

ā€œThe attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input’s attention_mask to obtain reliable results.ā€

Thank you so much for taking the time to reply, but unfortunately, that’s not the problem.

I researched that warning before and it’s benign, because HF automatically fixes it for you if these are missing (if anyone is curious, the code where this is handled is transformers/src/transformers/generation/utils.py at 8e164c5400b7b413c7b8fb32e35132001effc970 Ā· huggingface/transformers Ā· GitHub )

However, I still wanted to fix the warning, which I do now in the updated notebook, but the problem still remains in that the output just adds a ton of spaces to the end.

How can this be achieving 100% val accuracy if it’s just adding spaces?

I mean, and what even does ā€œaccuracyā€ mean in the context of causal language modeling? I figured it meant it learned to predict the next word 100% accurately even in the validation set, but now that I think about it, that itself seems impossible.

I figured it out!!!

The issue was that I specified my own loss, using Keras’ SparseCategoricalCrossentropy(from_logits=True). I had read that the HuggingFace models did not need you to specify a loss function, but that you still could. I guess I was testing my understanding that it should be cross-entropy loss when I decided to define it anyway.

Well, after digging into the source code, I found that the HF model uses a special loss with class CausalLanguageModelingLoss instead. It is basically the SparseCategoricalCrossentropy(from_logits=True) loss, but there is also some additional masking done of the -100 labels, which I believe are used for padding. Since the model is causal and can’t look into the future, those labels need to be masked out when calculating the loss, and my base loss function was not doing that.

Simply not specifying a loss function helped it to train, and is now generating text! Thank you for looking at this with me. Even though fixing that earlier warning wasn’t the issue, starting down that path is how I eventually got to thinking about the loss function it was computing, leading to the solution. :grin:

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.