Hi guys,
Since 2019, when OpenAI introduced to us GPT2, a lot has changed and new methods/optimization schemes emerged.
I believe GPT2 is sub-optimal considering the jump NLP made since then.
Therefore, I’m trying to continue pre-training GPT2 (small, medium, large), and would love to hear from your experience!
- I’m using the openwebtext dataset, do any of you recommend a better/richer one?
- Did any of you try distillation to continue pre-train GPT2?
- Any other SOTA trick/optimization method you do recommend?