I’m trying to find the existing evaluation results for training GPT-2 on WikiText-2. In GPT-2 model card, it is mentioned that the perplexity is 29.41, whereas in this blog post by OpenAI, it is said that the perplexity is 18.34 for this task.
I was wondering whether this difference is due to different loss (huggingface used the casual language modeling loss) ?
No, the difference is in what model is evaluated. The model card takes the results reported in the paper for the smallest GPT-2 model, the PPL of 18.34 is for the largest one, which is
gpt2-xl on the hub.
Thanks a lot for the reply.
I was also wondering how many epochs you suggest for training GPT-2 from scratch so that it reaches PPL of 29.41?
You won’t reach that PPL without training on a larger dataset, like OpenAI did.