I generally know somewhat what Iām doing with ML and NLP, but new to Hugging Face Transformer models.
Trying to fine-tune DistilGPT2 model on some custom data. Iām getting 100% training and val accuracy, but then when I try to generate output from new prompts (Causal Language Modeling), the output is just a bunch of blank spaces.
The project Iām doing is for a portfolio project. But Iāve lost 5 days on this and feel like Iām gonna lose it . Can anyone please take a look and tell me what Iām doing wrong? Hopefully itās just some dumb mistake.
āThe attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your inputās attention_mask to obtain reliable results.ā
However, I still wanted to fix the warning, which I do now in the updated notebook, but the problem still remains in that the output just adds a ton of spaces to the end.
How can this be achieving 100% val accuracy if itās just adding spaces?
I mean, and what even does āaccuracyā mean in the context of causal language modeling? I figured it meant it learned to predict the next word 100% accurately even in the validation set, but now that I think about it, that itself seems impossible.
The issue was that I specified my own loss, using Kerasā SparseCategoricalCrossentropy(from_logits=True). I had read that the HuggingFace models did not need you to specify a loss function, but that you still could. I guess I was testing my understanding that it should be cross-entropy loss when I decided to define it anyway.
Well, after digging into the source code, I found that the HF model uses a special loss with class CausalLanguageModelingLoss instead. It is basically the SparseCategoricalCrossentropy(from_logits=True) loss, but there is also some additional masking done of the -100 labels, which I believe are used for padding. Since the model is causal and canāt look into the future, those labels need to be masked out when calculating the loss, and my base loss function was not doing that.
Simply not specifying a loss function helped it to train, and is now generating text! Thank you for looking at this with me. Even though fixing that earlier warning wasnāt the issue, starting down that path is how I eventually got to thinking about the loss function it was computing, leading to the solution.