Continue pre-training GPT2

Hi guys,

Since 2019, when OpenAI introduced to us GPT2, a lot has changed and new methods/optimization schemes emerged.
I believe GPT2 is sub-optimal considering the jump NLP made since then.

Therefore, I’m trying to continue pre-training GPT2 (small, medium, large), and would love to hear from your experience!

  • I’m using the openwebtext dataset, do any of you recommend a better/richer one?
  • Did any of you try distillation to continue pre-train GPT2?
  • Any other SOTA trick/optimization method you do recommend?