Evaluation results in training GPT-2 on WikiText-2

Hello everyone,

I’m trying to find the existing evaluation results for training GPT-2 on WikiText-2. In GPT-2 model card, it is mentioned that the perplexity is 29.41, whereas in this blog post by OpenAI, it is said that the perplexity is 18.34 for this task.

I was wondering whether this difference is due to different loss (huggingface used the casual language modeling loss) ?

No, the difference is in what model is evaluated. The model card takes the results reported in the paper for the smallest GPT-2 model, the PPL of 18.34 is for the largest one, which is gpt2-xl on the hub.

1 Like

Thanks a lot for the reply.

I was also wondering how many epochs you suggest for training GPT-2 from scratch so that it reaches PPL of 29.41?

You won’t reach that PPL without training on a larger dataset, like OpenAI did.

1 Like

I see. Thanks :slight_smile: